scrapy無法抓取craigslist

這個相同的代碼抓取黃皮書沒有問題，並按預期。將規則更改爲CL，然後點擊第一個網址，然後發出無相關輸出。scrapy無法抓取craigslist

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from craigs.items import CraigsItem 

class MySpider(CrawlSpider): 
     name = "craigs" 
     allowed_domains = ["craiglist.org"] 

     start_urls = ["http://newyork.craigslist.org/cpg/"] 

     rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('/html/body/blockquote[3]/p/a',)), follow=True, callback='parse_profile')] 

     def parse_profile(self, response): 
       found = [] 
       img = CraigsItem() 
       hxs = HtmlXPathSelector(response) 
       img['title'] = hxs.select('//h2[contains(@class, "postingtitle")]/text()').extract() 
       img['text'] = hxs.select('//section[contains(@id, "postingbody")]/text()').extract() 
       img['tags'] = hxs.select('//html/body/article/section/section[2]/section[2]/ul/li[1]').extract() 

       print found[0] 
       return found[0]

這裏是輸出http://pastie.org/6087878 正如你所看到的，它有沒有問題，獲得第一個URL抓取http://newyork.craigslist.org/mnh/cpg/3600242403.html> 但隨後死亡。我可以使用CLI並轉儲所有像這樣的鏈接SgmlLinkExtractor（restrict_xpaths =（'/ html/body/blockquote [3]/p/a'，））。extract_links（response）with xpaths or keyword SgmlLinkExtractor（ allow_r'/ cpg /.+'）。extract_links（response）
output - >http://pastie.org/6085322

但是在爬網中，同樣的查詢失敗。 WTF？

來源

2013-02-07 user1544207

tcpdump數據包捕獲在將CLI操作與'執行'恢復數據的情況進行比較時，會發送到「不」恢復數據的爬網操作。你可以清楚地看到craigslist'是'放棄數據。在GET之後，我從URL中收到大量的http數據。但scrapy在抓取過程中沒有做任何事情。這100％看起來像scrapy /（用戶）方面的錯誤。我被卡住了。 http://tinypic.com/r/289lw9/6 http://tinypic.com/r/2dugcj/6 – user1544207

腳本運行並執行起始URL的第一個GET，並從CL收到數據，如預期的那樣。但是，儘管scrapy SgmlLinkExtractor在調試中說了什麼，但從來沒有第二個GET請求甚至離開NIC。 [craigs] DEBUG：已過濾異常請求至'newyork.craigslist.org'： Scrapy說過，但tcpdump講述了一個不同的故事。 – user1544207

我解析'確實'的工作。但數據收集永遠不會執行？ – user1544207

，如果你在文檔看看你會看到

allowed_domains包含這個蜘蛛可以抓取域串的可選列表。如果 OffsiteMiddleware已啓用，則不會遵循不屬於此列表中指定的域名的URL的請求。

您的允許訪問的

，但你想取一個子

02-07 15:39:03+0000 [craigs] DEBUG: Filtered offsite request to 'newyork.craigslist.org': <GET http://newyork.craigslist.org/mnh/cpg/3600242403.html>

這就是爲什麼它被過濾

無論是從外接的履帶去除allowed_domains適當的域名，以避免過濾異地請求

來源

2013-02-08 07:12:51

謝謝！刪除允許使用的域名 – user1544207

@ user1544207不要忘記接受該答案，以便其他人可以從此帖子中受益（可以通過點擊tick markow upvote標誌來接受此答案） –

scrapy無法抓取craigslist

回答

相關問題