Scrapy迭代選擇器產生n個重複的項目，用於發現頁面上的選擇器數量

我有一個工作刮板，用於收集評論網站的信息。我遇到的問題是，當我抓取一個商業頁面與幾個評論，並嘗試產生項目時，我只獲得第一個項目n次（其中n是選擇器發現的評論數量）。Scrapy迭代選擇器產生n個重複的項目，用於發現頁面上的選擇器數量

我讀過很多關於發電機的文章，我確定這是因爲我沒有正確地思考問題。這是一個簡化的片段。瞭解我有一個更復雜的使用回調等的爬行器，但是這段代碼產生了我正在談論的行爲。

from scrapy import Spider 
from scrapy.selector import Selector 
from yelp.items import ReviewItem 

class CategorySpider(Spider): 
    name = "yelp_search_" 
    allowed_domains = ["yelp.com"] 

    start_urls = ["http://www.yelp.com/biz/j-crew-arden"] 

    def parse(self, response): 
     sel = Selector(response) 

     # There are 9 particular reviews on this page 
     reviews_info = sel.xpath('//div[contains(@class, "review review--with-sidebar") and @itemprop="review"]') 
     for reviewSelector in reviews_info: 
      #If I print the extracted review selector here, I can confirm that only the first review selector is being used 
      #In other words, I expect extract first will extract the one and only result within the revewSelector 
      #Note: if I just do extract(), the item property is populated with a list of all 9 reviewSelectors 
      #i.e. a list of 9 usernames given to me 9 times 
      reviewitem = ReviewItem() 
      reviewitem["username"] = reviewSelector.xpath('//*[@itemprop="author"]/@content').extract_first() 
      reviewitem["userprofileurl"] = reviewSelector.xpath('//*[@class="user-display-name"]/@href').extract_first() 
      reviewitem["userlocation"] = reviewSelector.xpath('//*[contains(@class, "user-location responsive-hidden-small")]/text()').extract_first().strip() 
      reviewitem["reviewtext"] = reviewSelector.xpath('//*[@itemprop="description"]/@content').extract_first() 
      reviewitem["reviewrating"] = reviewSelector.xpath('//*[@itemprop="ratingValue"]/@content').extract_first() 
      reviewitem["reviewdate"] = reviewSelector.xpath('//*[@itemprop="datePublished"]/@content').extract_first() 
      reviewitem["reviewvotesuseful"] = reviewSelector.xpath('//a[@rel="useful"]/span[@class="count"]/text()').extract_first() 
      yield reviewitem

這個特定的代碼會給我9個刮取的結果，但它們都是第一個reviewSelector。

我在這裏做錯了什麼？

來源

2016-07-29 matisetorm

一旦你有你的「子選擇器」reviewSelector你需要在xpath之前使用.來指示子選擇器的級別。

即此：

reviewSelector.xpath('//*[@itemprop="author"]/@content').extract_first()

應該是：

reviewSelector.xpath('.//*[@itemprop="author"]/@content').extract_first()

來源

2016-07-29 12:46:38 Granitosaurus

這就是它！我不敢相信這很簡單。非常感謝你。 – matisetorm

@matisetorm沒有問題，這可能是關於xpath的最常見問題。 – Granitosaurus

Scrapy迭代選擇器產生n個重複的項目，用於發現頁面上的選擇器數量

回答

相關問題