2013-03-25 45 views
3

我使用scrapy來抓取網站。Scrapy。從div中提取html而不包裝父標記

我想提取某些div的內容。

<div class="short-description"> 
{some mess with text, <br>, other html tags, etc} 
</div> 

loader.add_xpath('short_description', "//div[@class='short-description']/div") 

通過代碼我得到了我的需要,但結果包括包裝HTML(<div class="short-description">...</div>

如何擺脫父HTML標籤?

備註。像文本(),節點()選擇器不能幫助我,因爲我的div包含<br>, <p>, other divs, etc.,空格,我需要保留它們。

回答

2
hxs = HtmlXPathSelector(response) 
for text in hxs.select("//div[@class='short-description']/text()").extract(): 
    print text 
2

組合嘗試node()Join()

loader.get_xpath('//div[@class="short-description"]/node()', Join()) 

,結果看起來像:

>>> from scrapy.contrib.loader import XPathItemLoader 
>>> from scrapy.contrib.loader.processor import Join 
>>> from scrapy.http import HtmlResponse 
>>> 
>>> body = """ 
...  <html> 
...   <div class="short-description"> 
...    {some mess with text, <br>, other html tags, etc} 
...    <div> 
...     <p>{some mess with text, <br>, other html tags, etc}</p> 
...    </div> 
...    <p>{some mess with text, <br>, other html tags, etc}</p> 
...   </div> 
...  </html> 
... """ 
>>> response = HtmlResponse(url='http://example.com/', body=body) 
>>> 
>>> loader = XPathItemLoader(response=response) 
>>> 
>>> print loader.get_xpath('//div[@class="short-description"]/node()', Join()) 

      {some mess with text, <br> , other html tags, etc} 
      <div> 
       <p>{some mess with text, <br>, other html tags, etc}</p> 
      </div> 
      <p>{some mess with text, <br>, other html tags, etc}</p> 
>>> 
>>> loader.get_xpath('//div[@class="short-description"]/node()', Join()) 
u'\n   {some mess with text, <br> , other html tags, etc}\n 
    <div>\n   <p>{some mess with text, <br>, other html tags, etc}</p>\n 
    </div> \n  <p>{some mess with text, <br>, other html tags, etc}</p> \n'