2016-01-19 82 views
0

我努力學習Scrapy和我設法爬一些網站和其他我失敗的例子: 我會嘗試抓取:http://www.polyhousestore.com/Scrapy犯規獲得產品的電子商務網站

我創建了一個測試蜘蛛,將讓所有在頁面中的產品 http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60

當我運行蜘蛛我得到它沒有找到任何產品。 有人可以幫我理解我做錯了什麼,它與CSS :: before和:: after有關嗎? 我該如何讓它工作?

蜘蛛的代碼(即輪不到在頁面中的產品)

# -*- coding: utf-8 -*- 
import scrapy 
from scrapy.selector import Selector 

class PolySpider(scrapy.Spider): 
    name = "poly" 
    allowed_domains = ["polyhousestore.com"] 
    start_urls = (
    'http://www.polyhousestore.com/catalogsearch/result/?cat=&q=lc+60', 
    ) 
def parse(self, response): 
    sel = Selector(response) 
    products =   sel.xpath('/html/body/div[4]/div/div[5]/div/div/div/div/div[2]/div[3]/div[2]/div') 
    items = [] 
    if not products: 
     print '------------- No products from sel.xpath' 
    else: 
     print '------------- Found products ' + str(len(products)) 
,我跑

命令行和輸出

D:\scrapyProj\cmdProj>scrapy crawl poly 
2016-01-19 10:23:16 [scrapy] INFO: Scrapy 1.0.3 started (bot: cmdProj) 
2016-01-19 10:23:16 [scrapy] INFO: Optional features available: ssl, http11 
2016-01-19 10:23:16 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cm 
dProj.spiders', 'SPIDER_MODULES': ['cmdProj.spiders'], 'BOT_NAME': 'cmdProj'} 
2016-01-19 10:23:17 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol 
e, LogStats, CoreStats, SpiderState 
2016-01-19 10:23:17 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddl 
eware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultH 
eadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMidd 
leware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-01-19 10:23:17 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddlewa 
re, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-01-19 10:23:17 [scrapy] INFO: Enabled item pipelines: 
2016-01-19 10:23:17 [scrapy] INFO: Spider opened 
2016-01-19 10:23:17 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i 
tems (at 0 items/min) 
2016-01-19 10:23:17 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-01-19 10:23:17 [scrapy] DEBUG: Crawled (200) <GET http://www.polyhousestore 
.com/catalogsearch/result/?cat=&q=lc+60> (referer: None) 
------------- No products from sel.xpath 
2016-01-19 10:23:18 [scrapy] INFO: Closing spider (finished) 
2016-01-19 10:23:18 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 254, 
'downloader/request_count': 1, 
'downloader/request_method_count/GET': 1, 
'downloader/response_bytes': 16091, 
'downloader/response_count': 1, 
'downloader/response_status_count/200': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 1, 19, 8, 23, 18, 53000), 
'log_count/DEBUG': 2, 
'log_count/INFO': 7, 
'response_received_count': 1, 
'scheduler/dequeued': 1, 
'scheduler/dequeued/memory': 1, 
'scheduler/enqueued': 1, 
'scheduler/enqueued/memory': 1, 
'start_time': datetime.datetime(2016, 1, 19, 8, 23, 17, 376000)} 
2016-01-19 10:23:18 [scrapy] INFO: Spider closed (finished) 

感謝您的幫助

回答

0

當我查看Chrome問題中提供的網址時,我只能看到2 div標籤在該網站的body。這意味着scrapy也會看到這些div標籤。但是,您希望訪問不存在的第4個,因此您的搜索不會返回任何元素。

如果我打開一個scrapy殼上的身體我得到了div標籤執行數回報:

[<Selector xpath='count(/html/body/div)' data=u'2.0'>] 

這上面是一樣的

len(response.xpath('/html/body/div')) 

所有這意味着您必須修改您的查詢以獲取所有產品。如果您需要在網站上的4個元素,嘗試:

response.xpath('//div[@class="item-inner"]') 

正如你可以看到你不需要包裝的響應與scrapy選擇了。

+0

謝謝你的回答 – Ron

+0

我試着去了解我哪裏出錯了?我從鉻檢查中得到了路徑,所以我不明白爲什麼它不適用於我?而且你是否得到了物品 - 我沒有找到它? – Ron

+0

我只是打開你的網站的Chrome檢查器,選擇其中一個項目,它是''div'與'item-inner'類。當然,項目內容周圍有一層'div'標籤,因此您可以優化XPath,但我只想顯示搜索的位置。 – GHajba

相關問題