我想用Scrapy刮 - www.paytm.com。該網站使用AJAX請求,以XHR的形式顯示搜索結果。Scrapy刮板問題
我設法追蹤到XHR,而AJAX響應對於JSON是SIMILAR,但實際上並不是JSON。
這是XHR請求之一的鏈接 - https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6。如果您正確查看網址,則參數 - page_count - 負責顯示不同頁面的結果,而參數userQuery - 負責傳遞到網站的搜索查詢。
現在,如果您看到正確的響應。它實際上並不是JSON,只是看起來類似於JSON(我在http://jsonlint.com/上實現了它)。我想使用SCRAPY (SCRAPY僅僅是因爲它是一個框架,它比使用像BeautifulSoup這樣的其他庫更快,因爲使用它們來創建一個刮擦速度如此之快的刮板會花費很大的精力 - 是我想使用Scrapy的唯一原因。)。
現在,這是我的代碼片段,我用來提取從URL的JSON響應 - :
jsonresponse = json.loads(response.body_as_unicode())
print json.dumps(jsonresponse, indent=4, sort_keys=True)
上執行的代碼,它引發我一個錯誤,述明:
2015-07-05 12:13:23 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-05 12:13:23 [scrapy] INFO: Optional features available: ssl, http11
2015-07-05 12:13:23 [scrapy] INFO: Overridden settings: {'DEPTH_PRIORITY': 1, 'SCHEDULER_MEMORY_QUEUE': 'scrapy.squeues.FifoMemoryQueue', 'SCHEDULER_DISK_QUEUE': 'scrapy.squeues.PickleFifoDiskQueue', 'CONCURRENT_REQUESTS': 100}
2015-07-05 12:13:23 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-05 12:13:23 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-05 12:13:23 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-05 12:13:23 [scrapy] INFO: Enabled item pipelines:
2015-07-05 12:13:23 [scrapy] INFO: Spider opened
2015-07-05 12:13:23 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-05 12:13:23 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-05 12:13:24 [scrapy] DEBUG: Crawled (200) <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
2015-07-05 12:13:24 [scrapy] ERROR: Spider error processing <GET https://search.paytm.com/search/?page_count=2&userQuery=tv&items_per_page=30&resolution=960x720&quality=high&q=tv&cat_tree=1&callback=angular.callbacks._6> (referer: None)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "Startup App/SCRAPERS/paytmscraper_scrapy/paytmspiderscript.py", line 111, in parse
jsonresponse = json.loads(response.body_as_unicode())
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
2015-07-05 12:13:24 [scrapy] INFO: Closing spider (finished)
2015-07-05 12:13:24 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 343,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 6483,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 5, 6, 43, 24, 733187),
'log_count/DEBUG': 2,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2015, 7, 5, 6, 43, 23, 908135)}
2015-07-05 12:13:24 [scrapy] INFO: Spider closed (finished)
現在,我的問題,我如何使用Scrapy刮取這樣的響應?如果需要其他代碼,請隨時在評論中提問。我會甘心給它!
請提供與此相關的完整代碼。這將不勝感激!也許一些操縱JSON響應(來自python)(類似於字符串比較)也適用於我,如果它可以幫助我刮這個!
P.S:我無法每次手動(使用手)修改JSON響應,因爲這是網站給出的響應。所以,請建議一個編程(pythonic)的方式來做到這一點。最好,我想使用Scrapy作爲我的框架。
不必手動提取JSON。從請求中刪除'&callback = angular.callbacks._6' – 3zzy