Q

使用python robotparser

2011-10-05 62 views 0 likes

0

我不明白如何使用robotparser模塊中的解析函數。這是我試過的：使用python robotparser

In [28]: rp.set_url("http://anilattech.wordpress.com/robots.txt") 

In [29]: rp.parse("""# If you are regularly crawling WordPress.com sites please use our firehose to receive real-time push updates instead. 
# Please see http://en.wordpress.com/firehose/ for more details. 
Sitemap: http://anilattech.wordpress.com/sitemap.xml 
User-agent: IRLbot 
Crawl-delay: 3600 
User-agent: * 
Disallow: /next/ 
# har har 
User-agent: * 
Disallow: /activate/ 
User-agent: * 
Disallow: /signup/ 
User-agent: * 
Disallow: /related-tags.php 
# MT refugees 
User-agent: * 
Disallow: /cgi-bin/ 
User-agent: * 
Disallow:""") 

In [48]: rp.can_fetch("*","http://anilattech.wordpress.com/signup/") 
Out[48]: True

看起來rp.entries是[]。我不明白什麼是錯的。我嘗試過更簡單的例子，但同樣的問題。

2011-10-05 Anil Shanbhag

A

回答

1

這裏有兩個問題。首先，rp.parse方法需要一個字符串列表，因此您應該將.split("\n")添加到該行。

第二個問題是*用戶代理的規則存儲在rp.default_entry而不是rp.entries。如果你檢查你會看到它包含一個Entry對象。

我不確定誰在這裏有錯，但解析器的Python實現僅尊重第一個User-agent: *部分，因此在僅給出/next/的示例中是不允許的。其他不允許的行被忽略。我沒有閱讀規範，所以我不能說這是一個格式不正確的robots.txt文件，或者Python代碼是錯誤的。我會假設前者。

2011-10-05 09:38:36

0

嗯，我剛剛找到答案。

1。事情是這個robots.txt [來自wordpress.com]包含多個用戶代理聲明。 robotparser模塊不支持此功能。我刪除過多的User-agent: *行解決了這個問題。

2。解析的參數是Andrew所指出的列表。

2011-10-05 16:59:19

相關問題

11. 使從使用python
12. 硒的Python：使用Python
13. python多線程使用python
14. 使用Python 2.6的Anaconda Python
15. 使用boost :: python創建python collections.namedtple使用boost :: python
16. 寫使用python
17. 使用Boost :: Python
18. Python使用graphics.py
19. 使用Python
20. 使用python
21. Python - 使用
22. 使用python
23. 使用python/elementree
24. 使用python
25. 使用python
26. 使用Python
27. 使用python
28. 在使用Python
29. 使用Python中
30. 使用python