BeautifulSoup排除某個標籤（一個或多個）中的內容

我有以下的項目找到段落中的文字：BeautifulSoup排除某個標籤（一個或多個）中的內容

soup.find("td", { "id" : "overview-top" }).find("p", { "itemprop" : "description" }).text

我將如何排除<a>標籤中的所有文本？像in <p> but not in <a>？

來源

2014-12-22 David542

找到並加入所有text nodes在p標記，並檢查它的父不是一個a標籤：

p = soup.find("td", {"id": "overview-top"}).find("p", {"itemprop": "description"}) 

print ''.join(text for text in p.find_all(text=True) 
       if text.parent.name != "a")

演示（見無link text印刷）：

>>> from bs4 import BeautifulSoup 
>>> 
>>> data = """ 
... <td id="overview-top"> 
...  <p itemprop="description"> 
...   text1 
...   <a href="google.com">link text</a> 
...   text2 
...  </p> 
... </td> 
... """ 
>>> soup = BeautifulSoup(data) 
>>> p = soup.find("td", {"id": "overview-top"}).find("p", {"itemprop": "description"}) 
>>> print p.text 

     text1 
     link text 
     text2 
>>> 
>>> print ''.join(text for text in p.find_all(text=True) if text.parent.name != "a") 

     text1 

     text2

來源

2014-12-22 20:55:42 alecxe

使用LXML，

import lxml.html as LH 

data = """ 
<td id="overview-top"> 
    <p itemprop="description"> 
     text1 
     <a href="google.com">link text</a> 
     text2 
    </p> 
</td> 
""" 

root = LH.fromstring(data) 
print(''.join(root.xpath(
    '//td[@id="overview-top"]//p[@itemprop="description"]/text()')))

收益率

 text1 

     text2

也得到的<p>子標籤的文本，只使用一個雙斜槓，//text()，而不是一個單一的正斜槓：

print(''.join(root.xpath(
    '//td[@id="overview-top"]//p[@itemprop="description"]//text()')))

產量

 text1 
     link text 
     text2

來源

2014-12-22 21:13:11 unutbu

BeautifulSoup排除某個標籤（一個或多個）中的內容

回答

相關問題