2015-01-26 42 views
1

使用Scrapy 0.24 Selectors,我想提取的段落內容,包括其他元素的含量(以下爲例,它會是錨<a>其他元素的內容。我怎樣才能做到這一點?提取段落文本包括使用Scrapy選擇

守則

>>> from scrapy import Selector 
>>> html = """ 
     <html> 
      <head> 
       <title>Test</title> 
      </head> 
      <body> 
       <div> 
        <p>Hello, can I get this paragraph content without this <a href="http://google.com">Google link</a>? 
       </div> 
      </body> 
     </html> 
     """ 
>>> sel = Selector(text=html, type="html") 
>>> sel.xpath('//p/text()').extract() 
[u'Hello, can I get this paragraph content with this ', u'?'] 

輸出
[u'Hello, can I get this paragraph content with this ', u'?'] 

預期輸出

[u'Hello, can I get this paragraph content with this Google link?'] 
+0

嗯。你可以首先提取' 2015-01-26 23:25:44

回答

0

我會推薦BeautifulSoup。雖然scrapy是一個完整的爬行框架,但BS是一個強大的解析庫(Difference between BeautifulSoup and Scrapy crawler?)。

文件:http://www.crummy.com/software/BeautifulSoup/bs4/doc/

安裝:pip install beautifulsoup4

對於您的情況:

# 'html' is the one your provided 
from bs4 import BeautifulSoup 
soup = BeautifulSoup(html) 
res = [p.get_text().strip() for p in soup.find_all('p')] 

結果:

[u'Hello, can I get this paragraph content without this Google link?']