使用lxml解析HTML（python）

我試圖將HTML頁面的內容保存在.html文件中，但我只想將內容保存在標記「table」下。另外，我想刪除所有空標籤，如<b></b>。我做了所有這些事情已經與BeautifulSoup：使用lxml解析HTML（python）

f = urllib2.urlopen('http://test.xyz') 
html = f.read() 
f.close() 
soup = BeautifulSoup(html) 

txt = "" 

for text in soup.find_all("table", {'class': 'main'}): 
txt += str(text) 

text = BeautifulSoup(text) 
empty_tags = text.find_all(lambda tag: tag.name == 'b' and tag.find(True) is None and (tag.string is None or tag.string.strip()=="")) 
[empty_tag.extract() for empty_tag in empty_tags]

我的問題是：這是也有可能與LXML？如果是：這個+/-怎麼樣？非常感謝您的幫助。

來源

2013-08-25 MarkF6

'表= lxml.html.parse（ 'http://test.xyz'）.getroot（）.cssselect（ 'table.main'）'將讓你與類的''

元素「主要」。 '[lxml.html.tostring（t，method =「html」，encoding = unicode）for t]將會得到你的HTML內容（'method =「text」'會給你沒有標籤的文本內容）。什麼是你想排除的空標籤？ –

感謝您的回覆。空標籤只是沒有內容的標籤，例如： – MarkF6

好的。看到我的答案。 –

回答

import lxml.html 

# lxml can download pages directly 
root = lxml.html.parse('http://test.xyz').getroot() 

# use a CSS selector for class="main", 
# or use root.xpath('//table[@class="main"]') 
tables = root.cssselect('table.main') 

# extract HTML content from all tables 
# use lxml.html.tostring(t, method="text", encoding=unicode) 
# to get text content without tags 
"\n".join([lxml.html.tostring(t) for t in tables]) 

# removing only specific empty tags, here <b></b> and <i></i> 
for empty in root.xpath('//*[self::b or self::i][not(node())]'): 
    empty.getparent().remove(empty) 

# removing all empty tags (tags that do not have children nodes) 
for empty in root.xpath('//*[not(node())]'): 
    empty.getparent().remove(empty) 
# root does not contain those empty tags anymore

來源

2013-08-25 22:52:54

非常感謝這個回覆！ :) 是否可以刪除特定的空標籤（例如b：）？是否有可能通過「」替換「＆amp」之類的錯誤？ – MarkF6

用表達式編輯答案以僅刪除特定的空標籤。要刪除「＆amp」，你最好使用像're.sub（「&[^\s;」+ \ s「，」「，mystring）''的正則表達式（可能需要進一步測試） –

非常感謝！ :) 這工作正常:) – MarkF6

相關問題