有一定某種lxml
解析問題的周圍h1
因爲//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()
將包括你想要的文本節點,
但//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()
不會,儘管你想要的p
是在h1
元素中。在這方面的網頁的
HTML源代碼是:
<div class="product-shop detail-right">
<div class="prcdt-overview">
<div class="title">
<h1>
<div class="htag">Vincent Chase</div>
<p itemprop="name"> Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses</p>
</h1>
<span style="text-align:center;color:#329C92;font-size:12px;padding-top:5px">Product Id: 73871</span>
</div>
<div id="container2" style="display: none;">
<div class="product-options" id="product-options-wrapper">
看這個scrapy shell會話:
[email protected]:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
2013-10-15 13:16:33+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot)
2013-10-15 13:16:34+0200 [default] INFO: Spider opened
2013-10-15 13:16:35+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'>
[s] item {}
[s] request <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s] response <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0x354c310>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Python 2.7.3 (default, Jan 2 2013, 13:56:14)
Type "copyright", "credits" or "license" for more information.
IPython 0.13.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//h1//text()').extract()
Out[1]:
[u'\n ',
u'Vincent Chase',
u'\n ']
In [2]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]//p//text()').extract()
Out[2]:
[u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses',
u'Enter the details below as they appear on your prescription from your doctor. ',
u'Understand Your Prescription.',
u'Retail Store Price - Rs 1600',
u'You Save - Rs 800',
u'Retail Store Price - Rs 4500',
u'You Save - Rs 1010',
u'STATUS: ',
u'READY TO SHIP\t',
u'(LIMITED STOCK)',
u' ',
u'Delivered By 20 Oct,2013']
In [4]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//div[@class="htag"]//text()').extract()
Out[4]: [u'Vincent Chase']
In [5]: hxs.select('//*[@id="product_addtocart_form"]//div[@class="prcdt-overview"]/div[@class="title"]//p//text()').extract()
Out[5]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses']
In [6]:
建議:
本網站/網頁使用「的itemscope 「和」itemtype「屬性(請參閱http://schema.org/docs/gs.html#microdata_itemscope_itemtype),所以我建議您使用它們來提取所需的數據。
例如,你可以使用此XPath表達式:
//*[@itemscope and @itemtype="http://schema.org/Product"]
//*[@itemprop="name"]/text()
隨着HtmlXPathSelector,可以使用
In [1]: ''.join(hxs.select('//*[@itemscope and @itemtype="http://schema.org/Product"]//*[@itemprop="name"]/text()').extract()).strip()
Out[1]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'
例scrapy殼會話:
[email protected]:~$ scrapy shell http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html
2013-10-15 12:47:30+0200 [scrapy] INFO: Scrapy 0.18.2 started (bot: scrapybot)
2013-10-15 12:47:31+0200 [default] INFO: Spider opened
2013-10-15 12:47:32+0200 [default] DEBUG: Crawled (200) <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html class="no-js"><!--<![endif]--><hea'>
[s] item {}
[s] request <GET http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s] response <200 http://www.lenskart.com/vincent-chase-vc-5134-matt-black-grey-gradient-wayfarer-sunglasses.html>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0x3f54310>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Python 2.7.3 (default, Jan 2 2013, 13:56:14)
Type "copyright", "credits" or "license" for more information.
IPython 0.13.1 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: hxs.select("""
...: //*[@itemscope and @itemtype="http://schema.org/Product"]
...: //*[@itemprop="name"]/text()""")
Out[1]: [<HtmlXPathSelector xpath='\n//*[@itemscope and @itemtype="http://schema.org/Product"]\n //*[@itemprop="name"]/text()' data=u' Colorato VC 5134 Matt Black Grey Gradie'>]
In [2]: hxs.select("""
//*[@itemscope and @itemtype="http://schema.org/Product"]
//*[@itemprop="name"]/text()""").extract()
Out[2]: [u' Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses']
In [3]: ''.join(hxs.select("""
//*[@itemscope and @itemtype="http://schema.org/Product"]
//*[@itemprop="name"]/text()""").extract()).strip()
Out[3]: u'Colorato VC 5134 Matt Black Grey Gradient Wayfarer Sunglasses'
In [4]:
thnks ..它的工作大。你能解釋一下爲什麼它沒有用我的方法工作? – user2747776
@ user2747776,我認爲這是某種解析錯誤。我在scrapy shell中用不同的測試更新了我的答案 –