2014-02-27 64 views
17

我很難理解scrapy爬行蜘蛛規則。我有一個例子,因爲我想它不工作,所以它可以是兩件事:Scrapy規則如何與爬行蜘蛛一起工作

  1. 我不明白規則是如何工作的。
  2. 我形成不正確的正則表達式,阻止我得到我需要的結果。

確定這就是我想做的事:

我想寫爬行蜘蛛將從http://www.euroleague.net網站獲得所有可用的統計信息。 託管我需要的所有信息的網站頁面是here

步驟1

我在想什麼第一步是提取 「四季」 鏈接(S)和休耕它。 這裏是我打算匹配的HTML/href(我想一一匹配「Seasons」部分中的所有鏈接,但我認爲以一個鏈接爲例會更容易):

href="/main/results/by-date?seasoncode=E2001" 

這裏是一個規則/正則表達式,我爲它創造:

Rule(SgmlLinkExtractor(allow=('by-date\?seasoncode\=E\d+',)),follow=True), 

enter image description here

步驟2

當我被蜘蛛帶到網頁http://www.euroleague.net/main/results/by-date?seasoncode=E2001進行第二步時,我希望蜘蛛從章節「常規季節」中提取鏈接。在這種情況下,可以說它應該是「第一輪」。在HTML/HREF,我尋找的是:我構建

<a href="/main/results/by-date?seasoncode=E2001&gamenumber=1&phasetypecode=RS" 

和規則/正則表達式是:

Rule(SgmlLinkExtractor(allow=('seasoncode\=E\d+\&gamenumber\=\d+\&phasetypecode\=\w+',)),follow=True), 

enter image description here

步驟3

現在我到達頁面(http://www.euroleague.net/main/results/by-date?seasoncode=E2001&gamenumber=1&phasetypecode=RS)我準備好提取鏈接,這些鏈接會導向包含我需要的所有信息的頁面d: 我找HTML/HREF:

href="/main/results/showgame?gamenumber=1&phasetypecode=RS&gamecode=4&seasoncode=E2001#!boxscore" 

而且我正則表達式是必須遵循將是:

Rule(SgmlLinkExtractor(allow=('gamenumber\=\d+\&phasetypecode\=\w+\&gamecode\=\d+\&seasoncode\=E\d+',)),callback='parse_item'), 

enter image description here

問題

我想該爬蟲應該這樣工作: 該規則爬蟲是類似的東西 一個循環。當第一個鏈接匹配時,爬蟲將進入「步驟2」頁面,而不是「步驟3」,之後將提取數據。這樣做後,它將返回到「第1步」,以匹配第二個鏈接,並再次啓動循環,直到第一步中沒有鏈接時爲止。

我從終端看到似乎爬行程序在「步驟1」中循環。它遍歷所有「步驟1」鏈接,但不涉及「步驟2」/「步驟3」規則。

2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2000> (referer: http:// www.euroleague.net/main/results/by-date) 
2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2001> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2002> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 00:20:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2003> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 00:20:33+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2004> (referer: http://www.euroleague.net/main/results/by-date) 

它遍歷所有的「四季」的鏈接後,它開始與我沒有看到聯繫,在任何我提到的三個步驟:

http://www.euroleague.net/main/results/by-date?gamenumber=23&phasetypecode=TS++++++++&seasoncode=E2013 

而且這種鏈接結構,你可以只有當您循環執行「步驟2」中的所有鏈接而沒有返回到「步驟1」的起點時纔會發現。

問題是: 規則是如何工作的?它是一步一步工作,就像我打算應該用這個例子一樣,或者每個規則都有自己的循環,並且只有在循環完成第一條規則後才從規則到規則?

這就是我的看法。當然,這可能是我的規則/正則表達式有問題,這是非常可能的。

這裏是所有的東西,我從終端獲得:

scrapy crawl basketsp_test -o item6.xml -t xml 
2014-02-28 01:09:20+0200 [scrapy] INFO: Scrapy 0.20.0 started (bot: basketbase) 
2014-02-28 01:09:20+0200 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django 
2014-02-28 01:09:20+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'basketbase.spiders', 'FEED_FORMAT': 'xml', 'SPIDER_MODULES': ['basketbase.spiders'], 'FEED_URI': 'item6.xml', 'BOT_NAME': 'basketbase'} 
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled item pipelines: Basketpipeline3, Basketpipeline1db 
2014-02-28 01:09:21+0200 [basketsp_test] INFO: Spider opened 
2014-02-28 01:09:21+0200 [basketsp_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2014-02-28 01:09:21+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date> (referer: None) 
2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Filtered duplicate request: <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2013> - no more duplicates will be shown (see DUPEFILTER_CLASS) 
2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2000> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:23+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2001> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:23+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2002> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:24+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2003> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:24+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2004> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:25+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2005> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:26+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2006> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:26+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2007> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:27+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2008> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:27+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2009> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:28+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2010> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:29+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2011> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:29+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2012> (referer: http://www.euroleague.net/main/results/by-date) 
2014-02-28 01:09:30+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=24&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:30+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=23&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=22&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=21&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=20&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:33+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=19&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:34+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=18&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:34+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=17&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:35+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=16&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:35+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=15&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:36+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=14&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:37+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=13&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:37+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=12&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:38+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=11&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:39+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=10&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:39+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=9&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:40+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=8&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:40+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=7&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:41+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=6&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:42+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=5&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:42+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=4&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:43+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=3&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:44+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=2&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:44+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=1&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013) 
2014-02-28 01:09:44+0200 [basketsp_test] INFO: Closing spider (finished) 
2014-02-28 01:09:44+0200 [basketsp_test] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 13663, 
    'downloader/request_count': 39, 
    'downloader/request_method_count/GET': 39, 
    'downloader/response_bytes': 527838, 
    'downloader/response_count': 39, 
    'downloader/response_status_count/200': 39, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2014, 2, 27, 23, 9, 44, 569579), 
    'log_count/DEBUG': 46, 
    'log_count/INFO': 3, 
    'request_depth_max': 2, 
    'response_received_count': 39, 
    'scheduler/dequeued': 39, 
    'scheduler/dequeued/memory': 39, 
    'scheduler/enqueued': 39, 
    'scheduler/enqueued/memory': 39, 
    'start_time': datetime.datetime(2014, 2, 27, 23, 9, 21, 111255)} 
2014-02-28 01:09:44+0200 [basketsp_test] INFO: Spider closed (finished) 

這裏是一個規則的一部分從爬蟲:

class Basketspider(CrawlSpider): 
    name = "basketsp_test" 
    download_delay = 0.5 

    allowed_domains = ["www.euroleague.net"] 
    start_urls = ["http://www.euroleague.net/main/results/by-date"] 
    rules = (
     Rule(SgmlLinkExtractor(allow=('by-date\?seasoncode\=E\d+',)),follow=True), 
     Rule(SgmlLinkExtractor(allow=('seasoncode\=E\d+\&gamenumber\=\d+\&phasetypecode\=\w+',)),follow=True), 
     Rule(SgmlLinkExtractor(allow=('gamenumber\=\d+\&phasetypecode\=\w+\&gamecode\=\d+\&seasoncode\=E\d+',)),callback='parse_item'), 



) 
+0

建議您閱讀scrapy源代碼。 – kev

回答

6

我會傾向於使用BaseSpider刮板代替的履帶。使用basespider,您可以擁有更多的預期請求路徑流,而不是在頁面上查找所有hrefs,並根據全局規則訪問它們。使用yield Requests()繼續遍歷父鏈接和回調集,以便將輸出對象一直傳遞到最後。

從你的描述:

我認爲應該履帶工作是這樣的:掌管履帶有點像一個循環。當第一個鏈接匹配時,爬蟲將進入「步驟2」頁面,而不是「步驟3」,之後將提取數據。這樣做後,它將返回到「第1步」,以匹配第二個鏈接,並再次啓動循環,直到第一步中沒有鏈接時爲止。

像這樣的請求回調堆棧很適合你。既然你知道頁面的順序和你需要刮的頁面。這還具有在返回要處理的輸出對象之前能夠在多個頁面上收集信息的附加好處。

class Basketspider(BaseSpider, errorLog): 
    name = "basketsp_test" 
    download_delay = 0.5 

    def start_requests(self): 

     item = WhateverYourOutputItemIs() 
     yield Request("http://www.euroleague.net/main/results/by-date", callback=self.parseSeasonsLinks, meta={'item':item}) 

    def parseSeaseonsLinks(self, response): 

     item = response.meta['item'] 

     hxs = HtmlXPathSelector(response) 

     html = hxs.extract() 
     roundLinkList = list() 

     roundLinkPttern = re.compile(r'http://www\.euroleague\.net/main/results/by-date\?gamenumber=\d+&phasetypecode=RS') 

     for (roundLink) in re.findall(roundLinkPttern, html): 
      if roundLink not in roundLinkList: 
       roundLinkList.append(roundLink)   

     for i in range(len(roundLinkList)): 

      #if you wanna output this info in the final item 
      item['RoundLink'] = roundLinkList[i] 

      # Generate new request for round page 
      yield Request(stockpageUrl, callback=self.parseStockItem, meta={'item':item}) 


    def parseRoundPAge(self, response): 

     item = response.meta['item'] 
     #Do whatever you need to do in here call more requests if needed or return item here 

     item['Thing'] = 'infoOnPage' 
     #.... 
     #.... 
     #.... 

     return item 
13

你是正確的,根據source code之前返回到回調函數的每個響應,履帶遍歷規則,開始,從第一。當你寫規則時,你應該記住它。例如以下規則:

rules(
     Rule(SgmlLinkExtractor(allow=(r'/items',)), callback='parse_item',follow=True), 
     Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True), 
    ) 

第二條規則將永遠不會被應用,因爲所有的鏈接將被第一條規則與parse_item回調中提取。第二條規則的匹配將被過濾爲scrapy.dupefilter.RFPDupeFilter的重複項。您應該使用拒絕鏈接正確匹配:

rules(
     Rule(SgmlLinkExtractor(allow=(r'/items',)), deny=(r'/items/electronics',), callback='parse_item',follow=True), 
     Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True), 
    ) 
3

如果你是來自中國,我有一箇中國博客張貼關於這一點:

別再濫用scrapy CrawlSpider中的follow=True


讓我們看看如何規則引擎蓋下的工作:

def _requests_to_follow(self, response): 
    seen = set() 
    for n, rule in enumerate(self._rules): 
     links = [lnk for lnk in rule.link_extractor.extract_links(response) 
       if lnk not in seen] 
     for link in links: 
      seen.add(link) 
      r = Request(url=link.url, callback=self._response_downloaded) 
      yield r 

正如你所看到的,當我們跟隨一個鏈接時,所有規則使用for循環提取響應中的鏈接,然後將它們添加到設置的對象中。

,所有的反應將self._response_downloaded處理:

def _response_downloaded(self, response): 
    rule = self._rules[response.meta['rule']] 
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow) 

def _parse_response(self, response, callback, cb_kwargs, follow=True): 

    if callback: 
     cb_res = callback(response, **cb_kwargs) or() 
     cb_res = self.process_results(response, cb_res) 
     for requests_or_item in iterate_spider_output(cb_res): 
      yield requests_or_item 

    # follow will go back to the rules again 
    if follow and self._follow_links: 
     for request_or_item in self._requests_to_follow(response): 
      yield request_or_item 

,它可以追溯到self._requests_to_follow(response)連連。

總結:enter image description here