如何限制蜘蛛使用scrapy

我試圖抓取網站抓取特定的XPath，從產品頁面我試圖報廢產品的說明，而是如何選擇只產品說明：如何限制蜘蛛使用scrapy

xPath : hxs.select('//div[@class="product-shop"]/p/text()').extract()

的HTML是相當大的，所以請參見上面指定的鏈接..

我想只需要選擇產品說明中沒有其他細節...

如果我這樣做：

[" ".join([i.strip() for i in hxs.select('//div[@class="product-shop"]/p/text()').extract()])] 

output : 
[u'Itemcode: 12BTS28271 Brand: BASICS InStock - Ships within 2 business days. Tip: 90% of our shipments reach within 4 business days! This product is part of the Basics T.shirts line made of 100% Cotton. Stripes Muscle Fit T.shirts that come in Green Color. Casual that comes with Henley away.']

但我只想：在鍍鉻元素面板中的元素

[u'This product is part of the Basics T.shirts line made of 100% Cotton. Stripes Muscle Fit T.shirts that come in Green Color. Casual that comes with Henley away.']

來源

2013-06-25 vaibhav jain

是否有任何正則表達式或東西，以避免不必要的xPath –

Rightclicking告訴我：

enter image description here

//*[@id="product_addtocart_form"]/div[2]/div[1]/p[3]

指向

<p>This product is part of the Basics T.shirts line made of 100% Cotton.<br> 
         Stripes Muscle Fit T.shirts that come in Green Color.<br> 
         Casual that comes with Henley away.</p>

試穿this page相同XPATH還指出，說明有太多：

<p>This product is part of the Basics Shirts line made of 100% Cotton.<br> 
        Plain Slim Fit Shirts that come in Orange Color.<br> 
        Casual that comes with Button Down away.</p>

因此，它看起來像所有你需要做的是調用頁面上的XPath和你設置。您仍然應該驗證XPATH在所有情況下都能正常工作，因爲它總是容易根據頁面而改變。

來源

2013-06-25 14:05:42 TankorSmash

謝謝你，我不知道，xPath也可以寫成'div [2]'..that ....感謝 –

@ user2217267樂意幫忙！ – TankorSmash

如何限制蜘蛛使用scrapy

回答

相關問題