Python的BeautifulSoup find_all re.compile一組標籤

這裏庫內發現什麼是我的HTML數據：Python的BeautifulSoup find_all re.compile一組標籤

<td>4.2.2</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.2.zip">zip</a> (<a href="https://blah.org/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.2.zip.sha1">sha1</a>)</td>, 
<td align="center"><a href="https://blah.org/blah-.2.2.tar.gz">tar.gz</a> (<a href="https://blah.org/blah-4.2.2.tar.gz.md5">md5</a>|<ahref="https://blah.org/blah-4.2.2.tar.gz.sha1">sha1</a>)</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.2-IIS.zip">IISzip</a> (<a href="https://blah.org/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.2-IIS.zip.sha1">sha1</a>)</td>, 
<td>4.2.1</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.1.zip">zip</a> (<a href="https://blah.org/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.1.zip.sha1">sha1</a>)</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.1.tar.gz">tar.gz</a> (<a href="https://blah.org/blah-4.2.1.tar.gz.md5">md5</a> | <a href="https://blah.org/blah-4.2.1.tar.gz.sha1">sha1</a>)</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.1-IIS.zip">IIS zip</a> (<a href="https://blah.org/blah-4.2.1-IIS.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.1-IIS.zip.sha1">sha1</a>)</td>, 
<td>4.2</td> 
<td>1.0-platinum</td>

等。

我想重複這個頁面，內拉出唯一的版本號：

<td>4.2.2</td>

標籤。例如：

4.2.2

4.2.1

4.2

1.0白金

到目前爲止，我曾嘗試：

for tag in html.find_all('tbody', limit=1, string=re.compile("\<td\>(.*?)\<\/td\>")): 
print(tag.content)

什麼

rpart=html.find('tbody') 
for tds in rpart.find_all('td'): 
print(tds.find_all('\<td\>(.*?)\<\/td>'))

什麼

results=rpart.find_all('td', tds=re.compile("\<td\>(.*?)\<\/td\>"))

什麼

wphtml.find('tbody').find_all('td', tds=re.compile('\<td\>(.*?)\<\/td\>'))

什麼

for p in rpart.find_all('td', digits=re.compile('\<td\>(.*?)\<\/td\>')): 
print(p.contents)

什麼

我也注意到，軟件rpart是類型「的ResultSet」，所以我願意打賭它的東西很小在我失蹤。我在做什麼對神而言是錯誤的？

來源

2015-07-21 metallica1973

-1

正確的正則表達式是<td>(\d+(?:\.\d+)*)</td>。使用re.findall不需要BeautifulSoup：

import re 
html = """ 
<td>4.2.2</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.2.zip">zip</a> (<a href="https://blah.org/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.2.zip.sha1">sha1</a>)</td>, 
<td align="center"><a href="https://blah.org/blah-.2.2.tar.gz">tar.gz</a> (<a href="https://blah.org/blah-4.2.2.tar.gz.md5">md5</a>|<ahref="https://blah.org/blah-4.2.2.tar.gz.sha1">sha1</a>)</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.2-IIS.zip">IISzip</a> (<a href="https://blah.org/blah-4.2.2-IIS.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.2-IIS.zip.sha1">sha1</a>)</td>, 
<td>4.2.1</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.1.zip">zip</a> (<a href="https://blah.org/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.1.zip.sha1">sha1</a>)</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.1.tar.gz">tar.gz</a> (<a href="https://blah.org/blah-4.2.1.tar.gz.md5">md5</a> | <a href="https://blah.org/blah-4.2.1.tar.gz.sha1">sha1</a>)</td>, 
<td align="center"><a href="https://blah.org/blah-4.2.1-IIS.zip">IIS zip</a> (<a href="https://blah.org/blah-4.2.1-IIS.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.1-IIS.zip.sha1">sha1</a>)</td>, 
<td>4.2</td> 
""" 
print re.findall("<td>(\\d+(?:\\.\\d+)*)</td>", html)

來源

2015-07-21 20:36:51

非常感謝，不幸的是，我卡住了使用BeautifulSoup。我忘了添加到我原來的帖子中，td標籤中的一些文本具有字符，所以這就是爲什麼我有我的正則表達式指定以這種方式抓住它。 – metallica1973

首先，有在第三的最後一個標籤缺失的空間。這可能會導致使用BeautifulSoup解析問題。

有兩種方法，你可以用你提供的文本很容易退出這個功能：

BeautifulSoup：

html = BeautifulSoup(htmlString, 'html.parser') 
for tag in html.find_all('td', align=None): 
    print(tag.string)

純正則表達式（無BeautifulSoup）：

for val in re.findall(re.compile('\&lttd\>(.*?)\<\/td\>'), htmlString): 
    print val

最好我可以告訴，因爲BeautifulSoup正在通過標籤名稱進行搜索當使用「find_all」函數時，re.compile將使用正則表達式來查找匹配模式的標籤名稱。例如，如果你想找到所有的「TBODY」和「TD」標籤，你可以這樣做：

for tag in html.find_all(re.compile('t[d|b]')): 
    print tag.string

從被發現的標籤，那麼你就可以開口內訪問屬性或值/串並關閉標籤。我還沒有找到一種方法來使用BeautifulSoup通過它們的值/字符串來查找標籤。

下面是與一對夫婦的例子的情況下，它可以幫助的引用：BeautifulSoup Documentation - A Regular Expression

另外，在BeautifulSoup，在一個「find_all」的re.compile是用於「過濾/匹配」，而不是捕獲基團。意思是，正則表達式是匹配的模式。在這種情況下，您不能使用（。*？）提取部分值進行比較。

來源

2015-07-21 20:54:57

太棒了，我仍然想知道爲什麼我的任何方法使用re.compile和BeautifuilSoup都不起作用。 – metallica1973

我的評論太長了，所以我更新了上面的回覆以更詳細地回答你的問題。希望有所幫助。 –

Python的BeautifulSoup find_all re.compile一組標籤

回答

相關問題