2017-02-13 16 views
-1

嗨,我再次在C10計劃,並試圖刮亞馬遜網站;爬過(200),但沒有被刮 - Crawlera

我有這個問題,有時日誌說一個網站被抓取,但它不會刮我想要的數據,並按照我的指示跳到下一頁。從一些頁面它會刮一些它不會我不明白。就像我檢查代碼和網址的html一樣,還有一些項目需要在爬行的網站上抓取,但沒有被刮掉。任何人都可以幫助我理解最新的情況嗎?我在想,也許網站返回一個驗證碼,但即使如此,我認爲crawlera會自動重試它獲取驗證碼的請求。

下面是日誌:

'time': '2017-02-12', 
'title': u'Basic GIS Coordinates, Second Edition', 
'url': u'https://www.amazon.com/Basic-GIS-Coordinates-Second-Sickle/dp/1420092316/ref=sr_1_64?s=tradein-aps&srs=9187220011&ie=UTF8&qid=1486932384&sr=1-64'} 
2017-02-12 14:46:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s//s/ref=sr_nr_n_3/153-6246827-9833634?srs=9187220011&fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&bbn=227541&ie=UTF8&qid=1486860051&rnid=227541> (referer: None) 
2017-02-12 14:46:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s//s/ref=sr_nr_n_2/153-6246827-9833634?srs=9187220011&fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A52187011&bbn=227541&ie=UTF8&qid=1486860051&rnid=227541> (referer: None) 
2017-02-12 14:46:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/s/ref=sr_pg_2/153-6246827-9833634?bbn=227541&fst=as%3Aoff&ie=UTF8&page=2&qid=1486932385&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&srs=9187220011> (referer: https://www.amazon.com/s//s/ref=sr_nr_n_3/153-6246827-9833634?srs=9187220011&fst=as%3Aoff&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&bbn=227541&ie=UTF8&qid=1486860051&rnid=227541) 
2017-02-12 14:46:44 [scrapy.log] DEBUG: successfully added! 
2017-02-12 14:46:44 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/s/ref=sr_pg_2/153-6246827-9833634?bbn=227541&fst=as%3Aoff&ie=UTF8&page=2&qid=1486932385&rh=n%3A283155%2Cn%3A%211000%2Cn%3A173507%2Cn%3A173515%2Cn%3A227541%2Cn%3A13735&srs=9187220011> 
{'currency': u'$', 
+1

因爲你有一個crawlera計劃,我會建議在[他們的支持頁面](https://support.scrapinghub.com)詢問直接幫助 – eLRuLL

回答

0

當你在亞馬遜爬行,我的猜測是,你得到一個「驗證碼」頁面,而不是常規的產品頁面。

也許您應該打印您的回覆內容,而不僅僅是返回項目,那麼您就可以確定哪些頁面完全被抓取。

+0

是的,我有關於同一問題的其他帖子,有人建議它可能是一個驗證碼問題,所以我嘗試使用crawlera,因爲他們處理,但我仍然獲得相同的行爲。感謝您的建議,我會繼續並打印您建議瞭解正在發生的內容。儘管每個頁面都有什麼共同點?像我應該嘗試打印什麼? –

+0

嘗試response.body或類似的,如果它不是 – Hosni

+0

isnt response.body返回整個html腳本?在這種情況下,我會打印很多,不會看到太多我的意思是它會擁擠。說你的意思是人類可讀性是什麼意思?我怎樣才能做到這一點? –