如何使用beautifulsoup4提取html？

的HTML如下：如何使用beautifulsoup4提取html？

<td class='Thistd'><a ><img /></a>Here is some text.</td>

我只是想在<td>的字符串。我不需要<a>...</a>。我該怎麼做？

我的代碼：

from bs4 import BeautifulSoup 
html = """<td class='Thistd'><a><img /></a>Here is some text.</td>""" 

soup = BeautifulSoup(html) 
tds = soup.findAll('td', {'class': 'Thistd'}) 
for td in tds: 
    print td 
    print '============='

我得到的是<td class='Thistd'><a ><img /></a>Here is some text.</td>

但我只需要Here is some text.

來源

2015-10-14 jianbing Ma

之間的是什麼，你得到了什麼，你想 – The6thSense

對不起有什麼區別，有一些錯誤，現在已經修復。 –

使用td.getText()從您的元素中獲取純文本。

即）

for td in tds: 
    print td.getText() 
    print '============='

輸出：

Here is some text. 
=============

編輯：

可以刪除<a>元素然後打印左邊。 .extract方法移除從可用BS4對象

IE）的特定標籤

for td in tds: 
    td.a.extract() 
    print td

輸出：

<td class="Thistd">Here is some<b>here is a b tag </b></td>

來源

2015-10-14 06:50:27 Flickerlight

非常感謝Vignesh，爲您的答案增強。 – Flickerlight

高興地幫助:) – The6thSense

代碼：

from bs4 import BeautifulSoup 
html = """<td class='Thistd'><a ><img /></a>Here is some text.</td>""" 

soup = BeautifulSoup(html) 
tds = soup.findAll('td', {'class': 'Thistd'}) 
for td in tds: 
    print td.text#the only change you need to do 
    print '============='

輸出：

Here is some text. 
=============

注：

的.text用來獲取唯一指定BS4對象的文本屬性在這種情況下，它是td標籤。對於更多信息，它着眼於official site

來源

2015-10-14 06:45:50 The6thSense

OK.tks，但如果HTML代碼是這樣的「下面是一些這裏是AB標籤
和br標籤文本。」我們如何能得到「下面是一些這裏是ab標籤
和br標籤文本。「？ –

所以你也想得到標籤 – The6thSense

是的，只需要不需要 ...，但其他標籤 –

如何使用beautifulsoup4提取html？

回答

相關問題