使用BeautifulSoup提取特定的TD表格元素文本？

我試圖使用BeautifulSoup庫和im有點麻煩從自動生成的HTML表中提取IP地址。使用BeautifulSoup提取特定的TD表格元素文本？

的HTML的結構，像這樣：

<html> 
<body> 
    <table class="mainTable"> 
    <thead> 
     <tr> 
      <th>IP</th> 
      <th>Country</th> 
     </tr> 
    </thead> 
    <tbody> 
     <tr> 
      <td><a href="hello.html">127.0.0.1<a></td> 
      <td><img src="uk.gif" /><a href="uk.com">uk</a></td> 
     </tr> 
     <tr> 
      <td><a href="hello.html">192.168.0.1<a></td> 
      <td><img src="uk.gif" /><a href="us.com">us</a></td> 
     </tr> 
     <tr> 
      <td><a href="hello.html">255.255.255.0<a></td> 
      <td><img src="uk.gif" /><a href="br.com">br</a></td> 
     </tr> 
    </tbody> 
</table>

小碼下面從兩個TD行中提取文本，但我只需要IP的數據，而不是IP和國家數據：

from bs4 import BeautifulSoup 
soup = BeautifulSoup(open("data.htm")) 

table = soup.find('table', {'class': 'mainTable'}) 
for row in table.findAll("a"): 
print(row.text)

這個輸出：

127.0.0.1 
uk 
192.168.0.1 
us 
255.255.255.0 
br

我需要的是IP table.tbody.tr.td.a元素文本而不是國家table.tbody.tr.td.img.a元素。

BeautifulSoup是否有經驗豐富的用戶會對如何進行選擇和提取有所瞭解。

謝謝。

來源

2014-03-30 Pike Man

搜索只是第一<td>爲tbody每一行：

# html should contain page content: 
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]

或許更具可讀性：

rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')] 
iplist = [row.find('td').getText() for row in rows]

來源

2014-03-30 16:03:52

這給你右邊的列表：

>>> pred = lambda tag: tag.parent.find('img') is None 
>>> list(filter(pred, soup.find('tbody').find_all('a'))) 
[<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>]

只適用.text在這個列表的元素上。

上面的列表中有多個空的<a></a>標籤，因爲html中的<a>標籤沒有正確關閉。要擺脫他們，你可以使用

pred = lambda tag: tag.parent.find('img') is None and tag.text

，並最終：

>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))] 
['127.0.0.1', '192.168.0.1', '255.255.255.0']

來源

2014-03-30 16:34:54

不錯的方法和有用的解決方案。 –

您可以使用一個小的正則表達式，用於提取IP地址。具有正則表達式的BeautifulSoup是刮擦的好組合。

ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$") 
for row in table.findAll("a"): 
    if ip_pat.match(row.text): 
     print(row.text)

來源

2014-03-30 16:54:31

首先注意到HTML格式不正確。它不關閉a標籤。有 <a>標籤從這裏開始：

<a href="hello.html">127.0.0.1<a>

如果打印table你會看到BeautifulSoup被解析HTML爲：

<td> 
<a href="hello.html">127.0.0.1</a><a></a> 
</td> 
...

每個a後跟空a。

鑑於這些額外<a>標籤的存在，如果你想每第三<a>標籤，然後

for row in table.findAll("a")[::3]: 
    print(row.get_text())

足夠：

127.0.0.1 
192.168.0.1 
255.255.255.0

在另一方面，如果發生<a>標籤並不經常，你只需要那<a>個標籤有沒有上一個兄弟（如，但不限於<img>），然後

for row in table.findAll("a"): 
    sibling = row.findPreviousSibling() 
    if sibling is None: 
     print(row.get_text())

會工作。

如果你有lxml，該標準可以更簡潔地表示使用XPath：

import lxml.html as LH 
doc = LH.parse("data.htm") 
ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()') 
print(ips)

上面使用的XPath具有以下含義：

//table       select all <table> tags 
    [@class="mainTable"]   that have a class="mainTable" attribute 
//         from these tags select descendants 
    td/a        which are td tags with a child <a> tag 
    [not(preceding-sibling::img)] such that it does not have a preceding sibling <img> tag 
    /text()      return the text of the <a> tag

這需要一點時間到learn XPath，但一旦你瞭解它，你可能永遠不會想再次使用BeautifulSoup。

來源

2014-03-30 17:00:41 unutbu

使用BeautifulSoup提取特定的TD表格元素文本？

回答

相關問題