2012-12-08 55 views
2

我試圖解析與XMLFeedSpider如何解析HTML項目在XML文件中嵌入項目進行Scrapy

一個XML飼料在XML飼料我想提取「價格」項:

<span class="price" id="product-price-2037">19,77 €</span> 

但這個價位產品中的HTML代碼標籤裏面,因爲它遵循:

<channel> 
<item> 
<title> 
<![CDATA[ product title ]]> 
</title> 
<meta http-equiv="X-UA-Compatible" content="IE=8"/> 
<link>http://example.com/apage.html</link> 
<description> 
<![CDATA[ 
<table><tr><td><a href="http://example.com/apage.html"> 
<img src="http://example.com/media/catalog/product/aimage173.jpg" border="0" align="left" height="75" width="75"></a></td> 
<td style="text-decoration:none;"> <div class="price-bframe"> <p class="old-price"> <span class="price-label">Prix normal :</span> 
<span class="price" id="old-price-2895037">40,00 €</span> </p> 
<p class="special-price"> <span class="price-label">Prix spécial :</span> 
<span class="price" id="product-price-2037">19,77 €</span> </p> </div> </td></tr></table> 
]]> 
</description> 
</item> 
</channel> 

這裏是我的實際蜘蛛:

from scrapy.contrib.spiders import XMLFeedSpider 
from scrapy.selector import XmlXPathSelector 
from tutorial.items import DmozItem 

class DmozSpider(XMLFeedSpider): 
name = 'myspidername' 
allowed_domains = ["example.com"] 
start_urls = ['http://example.com/rss/catalog/new/store_id/1/'] 
iterator = 'iternodes' 
itertag = 'channel' 

def parse_node(self, response, node): 
    title = node.select('item/title/text()').extract() 
    link = node.select('item/link/text()').extract() 
    price = node.select('*[@class=price"]text()').extract() 
    item = DmozItem() 
    item['title'] = title 
    item['link'] = link 
    item['price'] = price 
    return item 

結果:

Invalid Xpath: *[@class=price"]text() 
+0

前的斜線鑑於網址以上是無效的,你怎麼測試這個呢? – Talvalin

回答

1

我認爲這是因爲你的路徑無效試試這個

[@class=price"]/text()

我想你錯過了文本