Python的robotparser忽視的Sitemaps

我有以下robots.txtPython的robotparser忽視的Sitemaps

User-agent: * 
Disallow: /images/ 
Sitemap: http://www.example.com/sitemap.xml

及以下robotparser

def init_robot_parser(URL): 
    robot_parser = robotparser.RobotFileParser() 
    robot_parser.set_url(urlparse.urljoin(URL, "robots.txt")) 
    robot_parser.read() 

    return robot_parser

但是當我做了print robot_parser以上return robot_parser我得到的是

User-agent: * 
Disallow: /images/

爲什麼它忽略了Sitemap線，我錯過了什麼？

來源

2010-06-04 Ben

Sitemap是標準的擴展，robotparser不支持它。您可以在the source中看到它只處理「user-agent」，「disallow」和「allow」。對於其目前的功能（告訴你是否允許特定的URL），理解Sitemap是不必要的。

來源

2010-06-04 22:16:36

是的，但我需要看看是否有指定的站點地圖來解析它們。我想我只需要通過urlopen打開機器人。謝謝。 – Ben 2010-06-04 22:29:04

您可以使用Repply（https://github.com/seomoz/reppy）來解析Robots.txt - 它處理站點地圖。

請記住，在某些情況下，默認位置（/sitemaps.xml）上有一個站點地圖，而站點所有者未在robots.txt中提及它（例如，在toucharcade.com上）

我還發現至少有一個站點的站點地圖被壓縮 - 這是robot.txt導致.gz文件。

來源

2013-08-12 10:33:03 kolinko

Python的robotparser忽視的Sitemaps

回答

相關問題