無法使用python scrapy在p標籤/元素內部刮取文本

我想從網站中提取產品名稱http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
使用x路徑
//*[@id="product_addtocart_form"]/div[7]/div/div[1]/h1/p。無法使用python scrapy在p標籤/元素內部刮取文本

我曾嘗試以下，但我的結果卻一無所獲 item['pname'] = ' '.join(hxs.select('//*[@id="product_addtocart_form"]/div[7]/div/div[1]/h1/p/text()').extract()).strip()

來源

2013-10-15 user2747776

有一定某種lxml解析問題的周圍h1因爲//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()將包括你想要的文本節點，

但//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()不會，儘管你想要的p是在h1元素中。在這方面的網頁的

HTML源代碼是：

<div class="product-shop detail-right"> 
    <div class="prcdt-overview"> 
     <div class="title"> 
            <h1> 
       <div class="htag">Vincent Chase</div> 
       <p itemprop="name"> Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses</p> 
      </h1> 
      <span style="text-align:center;color:#329C92;font-size:12px;padding-top:5px">Product Id: 73871</span> 
     </div>    

     <div id="container2" style="display: none;"> 
      <div class="product-options" id="product-options-wrapper">

看這個scrapy shell會話：

[email protected]:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html 
2013-10-15 13:16:33+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot) 
2013-10-15 13:16:34+0200 [default] INFO: Spider opened 
2013-10-15 13:16:35+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None) 
[s] Available Scrapy objects: 
[s] hxs  <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'> 
[s] item  {} 
[s] request <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> 
[s] response <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> 
[s] settings <CrawlerSettings module=None> 
[s] spider  <BaseSpider 'default' at 0x354c310> 
[s] Useful shortcuts: 
[s] shelp()   Shell help (print this help) 
[s] fetch(req_or_url) Fetch request (or URL) and update local objects 
[s] view(response) View response in a browser 
Python 2.7.3 (default, Jan 2 2013, 13:56:14) 
Type "copyright", "credits" or "license" for more information. 

IPython 0.13.1 -- An enhanced Interactive Python. 
?   -> Introduction and overview of IPython's features. 
%quickref -> Quick reference. 
help  -> Python's own help system. 
object? -> Details about 'object', use 'object??' for extra details. 

In [1]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()').extract() 
Out[1]: 
[u'\n       ', 
u'Vincent Chase', 
u'\n       '] 

In [2]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()').extract() 
Out[2]: 
[u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses', 
u'Enter the details below as they appear on your prescription from your doctor. ', 
u'Understand Your Prescription.', 
u'Retail Store Price - Rs 1600', 
u'You Save - Rs 800', 
u'Retail Store Price - Rs 4500', 
u'You Save - Rs 1010', 
u'STATUS: ', 
u'READY TO SHIP\t', 
u'(LIMITED STOCK)', 
u' ', 
u'Delivered By 20 Oct,2013'] 

In [4]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//div[@class="htag"]//text()').extract() 
Out[4]: [u'Vincent Chase'] 

In [5]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//p//text()').extract() 
Out[5]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'] 

In [6]:

建議：

本網站/網頁使用「的itemscope 「和」itemtype「屬性（請參閱http://schema.org/docs/gs.html#microdata_itemscope_itemtype），所以我建議您使用它們來提取所需的數據。

例如，你可以使用此XPath表達式：

//*[@itemscope and @itemtype="http://schema.org/Product"] 
    //*[@itemprop="name"]/text()

隨着HtmlXPathSelector，可以使用

In [1]: ''.join(hxs.select('//*[@itemscope and @itemtype="http://schema.org/Product"]//*[@itemprop="name"]/text()').extract()).strip() 
Out[1]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'

例scrapy殼會話：

[email protected]:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html 
2013-10-15 12:47:30+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot) 
2013-10-15 12:47:31+0200 [default] INFO: Spider opened 
2013-10-15 12:47:32+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None) 
[s] Available Scrapy objects: 
[s] hxs  <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'> 
[s] item  {} 
[s] request <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> 
[s] response <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> 
[s] settings <CrawlerSettings module=None> 
[s] spider  <BaseSpider 'default' at 0x3f54310> 
[s] Useful shortcuts: 
[s] shelp()   Shell help (print this help) 
[s] fetch(req_or_url) Fetch request (or URL) and update local objects 
[s] view(response) View response in a browser 
Python 2.7.3 (default, Jan 2 2013, 13:56:14) 
Type "copyright", "credits" or "license" for more information. 

IPython 0.13.1 -- An enhanced Interactive Python. 
?   -> Introduction and overview of IPython's features. 
%quickref -> Quick reference. 
help  -> Python's own help system. 
object? -> Details about 'object', use 'object??' for extra details. 

In [1]: hxs.select(""" 
    ...: //*[@itemscope and @itemtype="http://schema.org/Product"] 
    ...:  //*[@itemprop="name"]/text()""") 
Out[1]: [<HtmlXPathSelector xpath='\n//*[@itemscope and @itemtype="http://schema.org/Product"]\n //*[@itemprop="name"]/text()' data=u' Colorato VC 5134 Matt Black Grey Gradie'>] 

In [2]: hxs.select(""" 
//*[@itemscope and @itemtype="http://schema.org/Product"] 
    //*[@itemprop="name"]/text()""").extract() 
Out[2]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'] 

In [3]: ''.join(hxs.select(""" 
//*[@itemscope and @itemtype="http://schema.org/Product"] 
    //*[@itemprop="name"]/text()""").extract()).strip() 
Out[3]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses' 

In [4]:

來源

2013-10-15 10:43:43

thnks ..它的工作大。你能解釋一下爲什麼它沒有用我的方法工作？ – user2747776

@ user2747776，我認爲這是某種解析錯誤。我在scrapy shell中用不同的測試更新了我的答案 –

無法使用python scrapy在p標籤/元素內部刮取文本

回答

相關問題