如何關注Scrapy Crawler的下一頁以報廢內容

我能夠從第一頁中刪除所有的故事，我的問題是如何移動到下一頁並繼續刮取故事和名稱，請檢查我的代碼如何關注Scrapy Crawler的下一頁以報廢內容

# -*- coding: utf-8 -*- 
import scrapy 
from cancerstories.items import CancerstoriesItem 
class MyItem(scrapy.Item): 
    name = scrapy.Field() 
    story = scrapy.Field() 
class MySpider(scrapy.Spider): 

    name = 'cancerstories' 
    allowed_domains = ['thebreastcancersite.greatergood.com'] 
    start_urls = ['http://thebreastcancersite.greatergood.com/clickToGive/bcs/stories/'] 

    def parse(self, response): 

     rows = response.xpath('//a[contains(@href,"story")]') 

     #loop over all links to stories 
     for row in rows: 
      myItem = MyItem() # Create a new item 
      myItem['name'] = row.xpath('./text()').extract() # assign name from link 
      story_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link 
      request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story 
      request.meta['myItem'] = myItem # pass the item with the request 
      yield request 

    def parse_detail(self, response): 
     myItem = response.meta['myItem'] # extract the item (with the name) from the response 
     #myItem['name']=response.xpath('//h1[@class="headline"]/text()').extract() 
     text_raw = response.xpath('//div[@class="photoStoryBox"]/div/p/text()').extract() # extract the story (text) 
     myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item 
     yield myItem # return the item

來源

2016-02-10 leboMagma

你可以改變你scrapy.Spider的CrawlSpider，並使用Rule和LinkExtractor跟隨鏈接到下一個頁面。

對於這種方法，你必須包含下面的代碼：

... 
from scrapy.spiders import CrawlSpider, Rule 
from scrapy.linkextractors import LinkExtractor 
... 
rules = (
     Rule(LinkExtractor(allow='\.\./stories;jsessionid=[0-9A-Z]+?page=[0-9]+')), 
) 
... 
class MySpider(CrawlSpider): 
...

通過這種方式，爲每個頁面您訪問的蜘蛛會創造下一個頁面的請求（如果存在的話），遵循它，當結束執行解析方法，然後重複該過程。

編輯：

我寫的規則只是跟隨下一個頁面鏈接不提取的故事，如果你的第一種方法的工作原理，沒有必要去改變它。

另外，關於您評論中的規則，SgmlLinkExtractor已棄用，所以我建議您使用默認link extractor，並且規則本身沒有明確定義。

如果沒有定義在提取參數attrs，它searchs尋找在身上，在這種情況下看起來像../story/mother-of-4435而不是/clickToGive/bcs/story/mother-of-4435的href標籤鏈接。這就是它找不到任何鏈接的原因。

來源

2016-02-10 08:28:38 Javitronxo

規則（SgmlLinkExtractor（允許不要忘了重命名解析方法parse_start_url = （'/ clickToGive \/bcs \/stories \？\ page \ = [0-9] +'），），callback =「parseme」，follow = True），添加這個並不會抓取第一頁 – leboMagma

我編輯答案寫一個適當的答覆c omment，我希望它可以幫助 – Javitronxo

是的，謝謝，linkextractor現在可以跟隨鏈接，但它似乎不斷地將頁面剪貼到末尾，然後再次沿着prev鏈接到開頭 – leboMagma

可以手動遵循下面的頁面，如果你會使用scrapy.spider類，例如： next_page = response.css（ 'a.pageLink :: ATTR（HREF）'）extract_first（）如果next_page： absolute_next_page_url = response.urljoin（next_page）產量scrapy.Request（URL = absolute_next_page_url，回調= self.parse）如果你想使用CralwSpider類

來源

2017-06-29 11:03:22 user2070338

如何關注Scrapy Crawler的下一頁以報廢內容

回答

相關問題