给Typecho的Sitemap插件做谷歌站长平台爬虫适配（将文章页lastmod的日期时间改为单日期）

Kris

2022 年 04 月 05 日

1789 次浏览

暂无评论

2343字数

默认分类

使用的Sitemap插件：typechoSitemap · shiyueGG (github.com)

存在的问题

Sitemap插件导出的网站地图格式如下（节选）：

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:mobile="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc> https://www.9kr.cc/category/default/</loc>
<lastmod> 2022-04-05 </lastmod>
<changefreq> always </changefreq>
<priority> 0.9 </priority>
</url>
<url>
<loc> https://www.9kr.cc/archives/47/</loc>
<lastmod> 2022-04-05 04:54:01 </lastmod>
<changefreq> weekly </changefreq>
<priority> 0.9 </priority>
</url>
<url>
<loc> https://www.9kr.cc/tag/Nginx/</loc>
<lastmod> 2022-04-05 </lastmod>
<changefreq> always </changefreq>
<priority> 0.8 </priority>
</url>
<url>
<loc> https://www.9kr.cc/links.html</loc>
<lastmod> 2022-04-05 </lastmod>
<changefreq> monthly </changefreq>
<priority> 0.8 </priority>
</url>
</urlset>

可以看到文章页的lastmod标签含日期和时间，而其他则只有日期。

根据排查，Google爬虫获取的网站地图遇到日期和时间会报错。

问题解决

因为没有了解过Typecho插件编写规则，所以没有直接动插件，而是采用一种曲线救国的方法。

定时访问网站sitemap.xml文件，读取其中内容并且处理含时间日期的lastmod标签，输出到新的文件中。

Google的网站地图链接直接填这个新文件的链接。

代码编写

这里使用Python编写一个转换程序，转换后的map.xml放在网站根目录，可以通过https://[网址]/map.xml直接访问（实际使用按需更改）

使用nohup后台运行即可，经过测试Google站长平台能够正常识别不报错

# 2022-04-05 编写
# Typecho的sitemap插件构建的网站地图lastmod格式不符合Google爬虫格式
# sitemap插件构建的网站地图文章的lastmod格式是20xx-xx-xx xx:xx:xx，其他是20xx-xx-xx，而Google要求的是20xx-xx-xx

import requests,time,datetime
from bs4 import BeautifulSoup

# 对网站地图进行Google爬虫适配处理
def sitemap_DateTime2Date(nURL):
    r = requests.get(nURL)
    # 准备返回的xml数据，带上xml开头
    rDat = '<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:mobile="http://www.sitemaps.org/schemas/sitemap/0.9">'
    xmlDat = BeautifulSoup(r.text,'xml')
    # 一个个节点拿出来
    for nn in xmlDat.find_all('url'):
        lastmod = nn.lastmod.string
        # 根据长度判断lastmod格式是不是20xx-xx-xx xx:xx:xx
        if len(lastmod) == 21:
            # 是的话切掉后面的时间
            lastmod = lastmod[:-9]
        # 拼接处理后的节点
        nStr = '<url><loc>'+nn.loc.string+'</loc><lastmod>'+lastmod+'</lastmod>'
        nStr = nStr + '<changefreq>'+nn.changefreq.string+'</changefreq><priority>'+nn.priority.string+'</priority></url>'
        # 将节点拼接成准备返回的xml字符串
        rDat = rDat + nStr
    # 拼接xml结尾
    rDat = rDat + '</urlset>'
    return rDat

while True:
    # 网站地图链接
    url = 'https://www.9kr.cc/sitemap.xml'
    # 处理后的网站地图存放位置
    filePath = '/www/map.xml'
    # 写入新文件
    f = open(filePath,'w+',encoding='utf-8')
    f.write(sitemap_DateTime2Date(url))
    f.close()
    # 获取当前日期时间，截去后面的小数，可以用来写入log文件记录
    timeStr = str(datetime.datetime.now())[:19]
    print(timeStr)
    # 延时一小时，也可以去掉用Linux的计划任务
    time.sleep(3600)

可能遇到的问题及解决方法

报解析库错误

错误

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?

解决

安装lxml库即可解决：

pip install lxml

给Typecho的Sitemap插件做谷歌站长平台爬虫适配（将文章页lastmod的日期时间改为单日期）

存在的问题

问题解决

代码编写

可能遇到的问题及解决方法

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

SEO优化之增加自定义description和keywords标签菜单 - [Typecho/Handsome]

解决Windows下VLC播放HDR视频外挂字幕纯黑问题

[萤火工场CEK8902] eMMC固件烧录

Microsoft Visual C++ Redistributable 2005-2019 各版本下载链接(2019/2017/2015/2013/2012/2010/2008/2005)

开源家庭云系统CasaOS使用体验记录 [长期更新]

使用可观测平台监控你的轻量帕鲁服务器内存状况

在 CentOS 安装 Halo

DNSPod+CloudFlare实现CDN境内外分区域解析

Microsoft Visual C++ Redistributable 2005-2019 各版本下载链接(2019/2017/2015/2013/2012/2010/2008/2005)

Typecho使用腾讯云CDN后台卡在登录页无法登录解决

给Typecho的Sitemap插件做谷歌站长平台爬虫适配（将文章页lastmod的日期时间改为单日期）

存在的问题

问题解决

代码编写

可能遇到的问题及解决方法

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

给Typecho的Sitemap插件做谷歌站长平台爬虫适配（将文章页lastmod的日期时间改为单日期）

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款