BeautifulSoup和搜索按類

可能重複：
Beautiful Soup cannot find a CSS class if the object has other classes, too BeautifulSoup和搜索按類

我使用BeautifulSoup找到在HTML tables。我目前遇到的問題是使用class屬性中的空格。如果我的HTML讀取<html><table class="wikitable sortable">blah</table></html>，我似乎無法用下面的提取它（我在那裏能夠找到tables同爲class都wikipedia和wikipedia sortable）：

BeautifulSoup(html).findAll(attrs={'class':re.compile("wikitable(sortable)?")})

這會發現，如果表雖然我的HTML只是<html><table class="wikitable">blah</table></html>。同樣，我已經嘗試在我的正則表達式中使用"wikitable sortable"，並且這兩者都不匹配。有任何想法嗎？

來源

2011-05-04 cryptic_star

如果出現陸續CSS類wikitable，如class="something wikitable other"模式匹配也會失敗，所以如果你想，它的類屬性包含類wikitable所有的表，你需要接受更多的可能性，這樣一個規律：

html = '''<html><table class="sortable wikitable other">blah</table> 
<table class="wikitable sortable">blah</table> 
<table class="wikitable"><blah></table></html>''' 

tree = BeautifulSoup(html) 
for node in tree.findAll(attrs={'class': re.compile(r".*\bwikitable\b.*")}): 
    print node

結果：

<table class="sortable wikitable other">blah</table> 
<table class="wikitable sortable">blah</table> 
<table class="wikitable"><blah></blah></table>

只是爲了記錄在案，我不使用BeautifulSoup，並喜歡用lxml，正如其他人所提到的。

來源

2011-05-04 22:49:51 samplebias

就像更新一樣，BeautifulSoup（bs4）的最新版本可以更加優雅地處理這個問題：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class – Eli 2013-07-22 20:50:28

之一，使得比BeautifulSoup lxml更好的事情是正確的CSS類類選擇支持（甚至支持full css selectors，如果你想使用它們）

import lxml.html 

html = """<html> 
<body> 
<div class="bread butter"></div> 
<div class="bread"></div> 
</body> 
</html>""" 

tree = lxml.html.fromstring(html) 

elements = tree.find_class("bread") 

for element in elements: 
    print lxml.html.tostring(element)

給出：

<div class="bread butter"></div> 
<div class="bread"></div>

來源

2011-05-04 22:58:45 Acorn

+1即使這沒有幫助@allie寫BeautifulSoup代碼， lxml遠遠優越。 – Henry 2011-05-04 23:00:49

雖然我很欣賞那種優雅，但BeautifulSoup已經在這裏，而且暫時，這就是我需要使用的。 :) – 2011-05-04 23:21:18

BeautifulSoup和搜索按類

回答

相關問題