scrapy recursivelly刮phpBB論壇

我想使用scrapy抓取基於phpbb的論壇。我的scrapy知識水平非常基礎（但有所提高）。scrapy recursivelly刮phpBB論壇

提取論壇帖子的第一頁的內容或多或少容易。我成功的刮刀是這樣的：

import scrapy 

from ptmya1.items import Ptmya1Item 

class bastospider3(scrapy.Spider): 
    name = "basto3" 
    allowed_domains = ["portierramaryaire.com"] 
    start_urls = [ 
     "http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a" 
    ] 

    def parse(self, response): 
     for sel in response.xpath('//div[2]/div'): 
      item = Ptmya1Item() 
      item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract() 
      item['date'] = sel.xpath('div/div[1]/p/text()').extract() 
      item['body'] = sel.xpath('div/div[1]/div/text()').extract() 
      yield item

然而，當我嘗試使用抓取「下一頁」鏈接我有很多的無奈小時後失敗。我想向你展示我的嘗試，以徵求建議。 注：我寧願以獲得SgmlLinkExtractor變型的解決方案，因爲它們更靈活和強大，但是我這麼多的嘗試

首先一個，SgmlLinkExtractor有限制的路徑之後priorize成功。「下一頁的XPath」是

/html/body/div[1]/div[2]/form[1]/fieldset/a

事實上，我與

response.xpath('//div[2]/form[1]/fieldset/a/@href')[1].extract()

返回「下一頁」鏈接正確的值的外殼進行測試。不過，我想指出，引用XPath還提供二鏈接

>>> response.xpath('//div[2]/form[1]/fieldset/a/@href').extract() 
[u'./search.php?sid=5aa2b92bec28a93c85956e83f2f62c08', u'./viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a&sid=5aa2b92bec28a93c85956e83f2f62c08&start=15']

因此，我失敗刮刀是

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 

from ptmya1.items import Ptmya1Item 

class bastospider3(scrapy.Spider): 
    name = "basto7" 
    allowed_domains = ["portierramaryaire.com"] 
    start_urls = [ 
     "http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a" 
    ] 

    rules = (
      Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[2]/form[1]/fieldset/a/@href')[1],), callback="parse_items", follow= True) 
      ) 

    def parse_item(self, response): 
     for sel in response.xpath('//div[2]/div'): 
      item = Ptmya1Item() 
      item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract() 
      item['date'] = sel.xpath('div/div[1]/p/text()').extract() 
      item['body'] = sel.xpath('div/div[1]/div/text()').extract() 
      yield item

第二個，SgmlLinkExtractor與允許。更原始的和不成功的太

import scrapy 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 

from ptmya1.items import Ptmya1Item 

class bastospider3(scrapy.Spider): 
    name = "basto7" 
    allowed_domains = ["portierramaryaire.com"] 
    start_urls = [ 
     "http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a" 
    ] 

    rules = (
      Rule(SgmlLinkExtractor(allow=(r'viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a&start.',),), callback="parse_items", follow= True) 
      ) 

    def parse_item(self, response): 
     for sel in response.xpath('//div[2]/div'): 
      item = Ptmya1Item() 
      item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract() 
      item['date'] = sel.xpath('div/div[1]/p/text()').extract() 
      item['body'] = sel.xpath('div/div[1]/div/text()').extract() 
      yield item

最後，我回到該死的舊石器時代，或者它的第一等效教程。我嘗試使用包含在初學者教程末尾的循環。另一個失敗

import scrapy 
import urlparse 

from ptmya1.items import Ptmya1Item 

class bastospider5(scrapy.Spider): 
    name = "basto5" 
    allowed_domains = ["portierramaryaire.com"] 
    start_urls = [ 
     "http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a" 
    ] 

    def parse_articles_follow_next_page(self, response): 
     item = Ptmya1Item() 
     item['cacho'] = response.xpath('//div[2]/form[1]/fieldset/a/@href').extract()[1][1:] + "http://portierramaryaire.com/foro" 
     for sel in response.xpath('//div[2]/div'): 
      item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract() 
      item['date'] = sel.xpath('div/div[1]/p/text()').extract() 
      item['body'] = sel.xpath('div/div[1]/div/text()').extract() 
      yield item 

     next_page = response.xpath('//fieldset/a[@class="right-box right"]') 
     if next_page: 
      cadenanext = response.xpath('//div[2]/form[1]/fieldset/a/@href').extract()[1][1:] 
      url = urlparse.urljoin("http://portierramaryaire.com/foro",cadenanext) 
      yield scrapy.Request(url, self.parse_articles_follow_next_page)

在所有的情況，我已經得到的是一個神祕的錯誤消息，從我不能獲得一個提示我的問題的解決方案。

2015-10-08 21:24:46 [scrapy] DEBUG: Crawled (200) <GET http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a> (referer: None) 
2015-10-08 21:24:46 [scrapy] ERROR: Spider error processing <GET http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a> (referer: None) 
Traceback (most recent call last): 
    File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 76, in parse 
    raise NotImplementedError 
NotImplementedError 
2015-10-08 21:24:46 [scrapy] INFO: Closing spider (finished)

我真的很感謝任何意見（或更好的解決方案）的問題。我完全卡住了這一點，無論我讀了多少，我無法找到一個解決方案:(

來源

2015-10-08 Juan Luis Chulilla

由於您沒有使用parse方法，因此出現了神祕的錯誤消息。這是默認入口點當它要解析響應scrapy

但是你只定義的parse_articles_follow_next_page或parse_item功能 - 這是絕對沒有parse功能

，這是因爲下一個站點，但第一個站點沒有：。scrapy無法解析start_url，因此在任何情況下都無法完成您的嘗試。請嘗試將parse_items更改爲parse並且再次執行您的方法用於舊石器時代解決方案。

如果您使用的是Rule，那麼您需要使用不同的蜘蛛。對於那些使用CrawlSpider，你可以在教程中看到。在這種情況下，不要重寫parse方法，而應該像使用parse_items一樣。這是因爲CrawlSpider使用parse將響應轉發到回調方法。

來源

2015-10-09 06:32:36 GHajba

感謝您的回答。我試着將你的解決方案在定義和回調中將第一個scrapers從'parse_item'改爲'parse'，但是錯誤信息是一樣的。我找到了不同的教程，其中'parse'更改爲'parse_whatever'，他們顯然工作。例如，http://mherman.org/blog/2012/11/08/recursively-scraping-web-pages-with-scrapy/#.VheBHvntlBc。我不知道這個問題是否與給定回調的url的構造和/或類型有關... –

更新了答案。在使用規則時，你需要一個不同的蜘蛛類型 - 基本的'scrapy.Spider'不起作用。 – GHajba

再次感謝。第一個蜘蛛導入crawlSpider。當然，我從其他例子改編了它。錯誤信息仍然存在:( –

感謝GHajba，問題解決了。解決方案是在評論上發展起來的。

但是，蜘蛛不會按順序返回結果。它開始於http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a

，它應該通過「下一頁」的網址，它都是這樣走：http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a&start=15

遞增，每次15後「開始」變量。事實上，蜘蛛首先返回產生'start = 15'，然後'start = 30'，然後'start = 0'，然後再'start = 15'，然後'start = 45'產生的頁面......

我不確定是否必須創建一個新問題，或者如果未來的讀者在這裏開發問題會更好。你怎麼看？

來源

2015-10-09 16:16:46

scrapy recursivelly刮phpBB論壇

回答

相關問題