2014-06-24 37 views
2

我正在運行下面的蜘蛛,但它沒有進入解析方法,我不知道爲什麼,有人請幫忙。Scrapy沒有進入解析函數

我的代碼如下

from scrapy.item import Item, Field 
    from scrapy.selector import Selector 
    from scrapy.spider import BaseSpider 
    from scrapy.selector import HtmlXPathSelector 


    class MyItem(Item): 
     reviewer_ranking = Field() 
     print "asdadsa" 


    class MySpider(BaseSpider): 
     name = 'myspider' 
     allowed_domains = ["amazon.com"] 
     start_urls = ["http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp"] 
     print"sadasds" 
     def parse(self, response): 
      print"fggfggftgtr" 
      sel = Selector(response) 
      hxs = HtmlXPathSelector(response) 
      item = MyItem() 
      item["reviewer_ranking"] = hxs.select('//span[@class="a-size-small a-color-secondary"]/text()').extract() 
      return item 

這我得到的輸出是如下

$ scrapy runspider crawler_reviewers_data.py 
    asdadsa 
    sadasds 
    /home/raj/Documents/IIM A/Daily sales rank/Daily  reviews/Reviews_scripts/Scripts_review/Reviews/Reviewer/crawler_reviewers_data.py:18:  ScrapyDeprecationWarning: crawler_reviewers_data.MySpider inherits from deprecated class scrapy.spider.BaseSpider, please inherit from scrapy.spider.Spider. (warning only on first subclass, there may be others) 
    class MySpider(BaseSpider): 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot) 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Optional features available: ssl, http11 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Overridden settings: {} 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
    2014-06-24 19:21:35+0530 [scrapy] INFO: Enabled item pipelines: 
    2014-06-24 19:21:35+0530 [myspider] INFO: Spider opened 
    2014-06-24 19:21:35+0530 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
    2014-06-24 19:21:35+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6027 
    2014-06-24 19:21:35+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6084 
    2014-06-24 19:21:36+0530 [myspider] DEBUG: Crawled (403) <GET  http://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp> (referer: None) ['partial'] 
    2014-06-24 19:21:36+0530 [myspider] INFO: Closing spider (finished) 
    2014-06-24 19:21:36+0530 [myspider] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 259, 
'downloader/request_count': 1, 
'downloader/request_method_count/GET': 1, 
'downloader/response_bytes': 28487, 
'downloader/response_count': 1, 
'downloader/response_status_count/403': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2014, 6, 24, 13, 51, 36, 631236), 
'log_count/DEBUG': 3, 
'log_count/INFO': 7, 
'response_received_count': 1, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'start_time': datetime.datetime(2014, 6, 24, 13, 51, 35, 472849)} 
    2014-06-24 19:21:36+0530 [myspider] INFO: Spider closed (finished) 

請幫助我,我被困在這個非常一點。

回答

2

這是一個反網絡爬行技術,使用Amazon - 你得到403 - Forbidden,因爲它需要User-Agent頭與請求一起發送。

一種選擇是手動將其添加到從start_requests()產生的Request

class MySpider(BaseSpider): 
    name = 'myspider' 
    allowed_domains = ["amazon.com"] 

    def start_requests(self): 
     yield Request("https://www.amazon.com/gp/pdp/profile/A28XDLTGHPIWE1/ref=cm_cr_pr_pdp", 
         headers={'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}) 

    ... 

另一種選擇是設置DEFAULT_REQUEST_HEADERS設定項目範圍。

另請注意,Amazon提供了一個API它有一個python wrapper,考慮使用它。

希望有所幫助。

+0

非常感謝您的快速響應。手動添加方法不起作用,我得到相同的403錯誤。你能告訴我如何設置一個蜘蛛的Default_request_headers? – Raj

+0

@ user2019135你是否刪除了'start_urls'屬性?因爲我在發佈前測試了代碼 - 適用於我。 – alecxe

+0

@ user2019135這是[蜘蛛看起來如何](https://gist.github.com/alecxe/46f95778072ce4b59e79)。 – alecxe