使用scrapy抓取動態內容

我正在嘗試獲取Google Play商店的最新評論。我正在關注這個問題以獲得最新評論here 使用scrapy抓取動態內容

上述鏈接的答案中指定的方法可以很好地與scrapy shell一起工作，但是當我在我的爬蟲程序中嘗試這種方法時，它完全被忽略。

代碼片段：

import re 
import sys 
import time 
import urllib 
import urlparse 

from scrapy import Spider 
from scrapy.spider import BaseSpider 
from scrapy.http import Request, FormRequest 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor 

from play.items import PlayApp 

class PlaySpider(CrawlSpider): 
    name = "play" 
    allowed_domains = ["play.google.com"] 
    start_urls = [ 
      "https://play.google.com/store/apps" 
     ] 

    rules = (
     Rule(LxmlLinkExtractor(allow=('/store/apps$',)), callback='parseCategory',follow=True), 
    ) 

    def parseCategory(self, response): 
     """ 
      gets categories from store home page call parseLinks for each category 
     """ 
     #something here...... 
     yield Request(categoryapps, callback=self.parseLinks) 

    def parseLinks(self, response): 

     ''' 
     get all the links from the category page and then 
     pasess individual links to parseApp function. 
     '''  
     #something here 

     yield Request(link, callback=self.parseApp) 

    def parseApp(self, response): 

     ''' 
     parses apps page to get info about the app 
     ''' 

     #application page parsing ......   

     frmdata = {"id": "com.supercell.boombeach", "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'} 
     url = "https://play.google.com/store/getreviews" 
     yield FormRequest(url, callback=self.parse_data, formdata=frmdata) 

     yield app 

    def parse_data(self, response): 
     # do stuff with data... 
     print '\n\n---------------I am here------------------\n\n'

此功能parse_data永遠不會被調用。在#scrapy IRC和其他幾個地方問這個問題，但沒有幫助。請幫我解決一下這個。

這是對終端DEBUG響應：

DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=isoft.studios.ncert.ncertbooks) 
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=af.hindi.stories.booktwo) 
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.frozenex.latestnewsms) 
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.aqua.apps.english.hindi.dictionary) 
2015-06-03 13:56:07+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=com.merriamwebster) 
2015-06-03 13:56:08+0530 [play] DEBUG: Crawled (200) <POST https://play.google.com/store/getreviews> (referer: https://play.google.com/store/apps/details?id=an.HindiTranslate)

所以POST請求確實越來越發送，但是回調方法不會被調用。

來源

2015-06-03 Amit Tripathi

數據確實程序控制是否達到在'parseApp（）'後的試樣評論？ – Jithin

是的，從這裏獲得應用數據並將其存儲在mongoDB中。 –

你在這裏錯過了'id' – Jithin

好像你沒有改變表格數據中的id。

def parseApp(self, response): 
    apps = list(set(response.xpath('//a[@class="card-click-target"]/@href').extract())) 
    url = "https://play.google.com/store/getreviews" 
    for app in apps: 
     _id = app.strip('/store/apps/details?id=') 
     form_data = {"id": _id, "reviewType": '0', "reviewSortOrder": '0', "pageNum":'0'} 
     sleep(5) 
     yield FormRequest(url=url, formdata=form_data, callback=self.parse_data) 

def parse_app(self, response): 
    response_data = re.findall("\[\[.*", response.body) 
    if response_data: 
     try: 
      text = json.loads(response_data[0] + ']') 
      sell = Selector(text=text[0][2]) 
     except: 
      pass 
     # do whatever you want to extract using sell.xapth('YOUR_XPATH_HERE')

清洗你會得到這樣的事情

<div class="single-review"> 
    <a href="/store/people/details?id=106726831005267540508"> 
     <img class="author-image" alt="Lorence Gerona avatar image" src="https://lh3.googleusercontent.com/uFp_tsTJboUY7kue5XAsGA=w48-c-h48"> 
    </a> 
    <div class="review-header" data-expand-target="" data-reviewid="gp:AOqpTOHnsExa_P6JFRJD6HF5h71fpY91tNaEODjtfiTu-zPFki9ZnYsNp1HEcGFpGEfu9xqwJL_j-03Tx0e9lw"> 
     <div class="review-info"> 
      <span class="author-name"> 
       <a href="/store/people/details?id=106726831005267540508">Lorence Gerona</a> 
      </span> 
      <span class="review-date">3 June 2015</span> 
      <a class="reviews-permalink" href="/store/apps/details?id=com.supercell.boombeach&amp;reviewId=Z3A6QU9xcFRPSG5zRXhhX1A2SkZSSkQ2SEY1aDcxZnBZOTF0TmFFT0RqdGZpVHUtelBGa2k5Wm5Zc05wMUhFY0dGcEdFZnU5eHF3Skxfai0wM1R4MGU5bHc" title="Link to this review"></a> <div class="review-source" style="display:none"> 

     </div> 
     <div class="review-info-star-rating"> 
      <div class="tiny-star star-rating-non-editable-container" aria-label="Rated 5 stars out of five stars"> 
       <div class="current-rating" style="width: 100%;"> 

       </div> 
      </div> 
     </div> 
    </div> 
    <div class="rate-review-wrapper"> 
     <div class="play-button icon-button small rate-review" title="Spam" data-rating="SPAM"> 
      <div class="icon spam-flag"></div> 
     </div> 
     <div class="play-button icon-button small rate-review" title="Helpful" data-rating="HELPFUL"> 
      <div class="icon thumbs-up"></div> 
     </div> 
     <div class="play-button icon-button small rate-review" title="Unhelpful" data-rating="UNHELPFUL"> <div class="icon thumbs-down"></div> 
    </div> 
</div> 
</div> 
<div class="review-body"> 
<span class="review-title">Team BOOM BEACH</span> 
Amazing game I can defeat hammerman 
<div class="review-link" style="display:none"> 
    <a class="id-no-nav play-button tiny" href="#" target="_blank">Full Review</a> 
</div> 
</div> 
</div>

來源

2015-06-03 09:52:01 Jithin

使用scrapy抓取動態內容

回答

相關問題