2013-06-24 262 views
1

因此,當我試圖從epinions.com刮取產品評論信息時,如果主評論文本太長,它有一個「閱讀更多」鏈接到另一個頁面。 我以「http://www.epinions.com/reviews/samsung-galaxy-note-16-gb-cell-phone/pa_~1」爲例,如果你看第一篇評論,你會明白我的意思。scrapy:蜘蛛中的小蜘蛛?

我想知道:是否有可能在for循環的每次迭代中都有一個小蜘蛛來抓取url並將評論從新鏈接中刪除?我有以下代碼,但它不適用於小蜘蛛。

這裏是我的代碼:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from epinions_test.items import EpinionsTestItem 
from scrapy.http import Response, HtmlResponse 

class MySpider(BaseSpider): 
    name = "epinions" 
    allow_domains = ["epinions.com"] 
    start_urls = ['http://www.epinions.com/reviews/samsung-galaxy-note-16-gb-cell-phone/pa_~1'] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//div[@class="review_info"]') 

     items = [] 
     for sites in sites: 
      item = EpinionsTestItem() 
      item["title"] = sites.select('h2/a/text()').extract() 
      item["star"] = sites.select('span/a/span/@title').extract() 
      item["date"] = sites.select('span/span/span/@title').extract() 
      item["review"] = sites.select('p/span/text()').extract() 
# Everything works fine and i do have those four columns beautifully printed out, until.... 

      url2 = sites.select('p/span/a/@href').extract() 
      url = str("http://www.epinions.com%s" %str(url2)[3:-2]) 
# This url is a string. when i print it out, it's like "http://www.epinions.com/review/samsung-galaxy-note-16-gb-cell-phone/content_624031731332", which looks legit. 

      response2 = HtmlResponse(url) 
# I tried in a scrapy shell, it shows that this is a htmlresponse... 

      hxs2 = HtmlXPathSelector(response2) 
      fullReview = hxs2.select('//div[@class = "user_review_full"]') 
      item["url"] = fullReview.select('p/text()').extract() 
# The three lines above works in an independent spider, where start_url is changed to the url just generated and everything. 
# However, i got nothing from item["url"] in this code. 

      items.append(item) 
     return items 

爲什麼項目[ 「網址」]返回什麼?

謝謝!

回答

1

您應該實例回調新Request並通過您itemmeta字典:

from scrapy.http import Request 
from scrapy.item import Item, Field 
from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 


class EpinionsTestItem(Item): 
    title = Field() 
    star = Field() 
    date = Field() 
    review = Field() 


class MySpider(BaseSpider): 
    name = "epinions" 
    allow_domains = ["epinions.com"] 
    start_urls = ['http://www.epinions.com/reviews/samsung-galaxy-note-16-gb-cell-phone/pa_~1'] 

    def parse(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//div[@class="review_info"]') 

     for sites in sites: 
      item = EpinionsTestItem() 
      item["title"] = sites.select('h2/a/text()').extract() 
      item["star"] = sites.select('span/a/span/@title').extract() 
      item["date"] = sites.select('span/span/span/@title').extract() 

      url = sites.select('p/span/a/@href').extract() 
      url = str("http://www.epinions.com%s" % str(url)[3:-2]) 

      yield Request(url=url, callback=self.parse_url2, meta={'item': item}) 

    def parse_url2(self, response): 
     hxs = HtmlXPathSelector(response) 

     item = response.meta['item'] 
     fullReview = hxs.select('//div[@class = "user_review_full"]') 
     item["review"] = fullReview.select('p/text()').extract() 
     yield item 

另見documentation

希望有所幫助。

+0

它幫助...很多..非常感謝你!我正在閱讀有關回調的文檔,希望我也能弄清楚:D – pforyogurt