2010-07-21 96 views
0

我正在用Python處理Python,看起來像這樣。我與LXML解析,但同樣可以愉快地使用pyquery:Python解析:lxml只能得到標記文本的一部分

<p><span class="Title">Name</span>Dave Davies</p> 
<p><span class="Title">Address</span>123 Greyfriars Road, London</p> 

拉出「名稱」和「地址」是死很容易,我使用什麼庫,但我如何得到本文的其餘部分 - 即'戴夫戴維斯'?

回答

1

每個元素都可以有一個text and a tail attribute(在鏈接,搜索單詞 「尾巴」):

import lxml.etree 

content='''\ 
<p><span class="Title">Name</span>Dave Davies</p> 
<p><span class="Title">Address</span>123 Greyfriars Road, London</p>''' 


root=lxml.etree.fromstring(content,parser=lxml.etree.HTMLParser()) 
for elt in root.findall('**/span'): 
    print(elt.text, elt.tail) 

# ('Name', 'Dave Davies') 
# ('Address', '123 Greyfriars Road, London') 
+0

完美 - 謝謝! – AP257 2010-07-21 18:45:32

0

看一看BeautifulSoup。我剛開始使用它,所以我不是專家。關閉我的頭頂:

import BeautifulSoup 

text = '''<p><span class="Title">Name</span>Dave Davies</p> 
      <p><span class="Title">Address</span>123 Greyfriars Road, London</p>''' 

soup = BeautifulSoup.BeautifulSoup(text) 

paras = soup.findAll('p') 

for para in paras: 
    spantext = para.span.text 
    othertext = para.span.nextSibling 
    print spantext, othertext 

[Out]: Name Dave Davies 
     Address 123 Greyfriars Road, London 
+0

感謝您的支持。我也喜歡BeautifulSoup,但我相信它不再被維護,所以我切換到lxml/pyquery。 – AP257 2010-07-21 18:45:57

2

另一種方法 - 使用XPath:

>>> from lxml import html 
>>> doc = html.parse(file) 
>>> doc.xpath('//span[@class="Title"][text()="Name"]/../self::p/text()') 
['Dave Davies'] 
>>> doc.xpath('//span[@class="Title"][text()="Address"]/../self::p/text()') 
['123 Greyfriars Road, London']