BeautifulSoup：解析頁

我想解析HTML頁面的部分中的一部分，說BeautifulSoup：解析頁

my_string = """ 
<p>Some text. Some text. Some text. Some text. Some text. Some text. 
    <a href="#">Link1</a> 
    <a href="#">Link2</a> 
</p> 
<img src="image.png" /> 
<p>One more paragraph</p> 
"""

我這個字符串傳遞給BeautifulSoup：

soup = BeautifulSoup(my_string) 
# add rel="nofollow" to <a> tags 
# return comment to the template

但在解析BeautifulSoup增加<html>， <head>和<body>標籤（如果使用lxml或html5lib解析器），並且我不需要這些代碼。我現在發現的唯一方法是避免使用html.parser。

我不知道是否有辦法擺脫冗餘標籤使用lxml - 最快的解析器。

UPDATE

本來我的問題是問不正確。現在我從我的示例中刪除了<div>包裝，因爲普通用戶不使用此標記。出於這個原因，我們不能使用.extract()方法來擺脫<html>,<head>和<body>標籤。

來源

2012-06-30 Vlad T.

您是否嘗試過使用MinimalSoup代替BeautifulSoup？（相同的庫，不同的構造函數）。對這種事情應該不那麼嚴格。 –

我試過，但我不明白它是如何工作的。 –

我可以用.contents物業解決問題那''.join(soup.body.contents)會更整齊的列表來轉換字符串，但這不起作用，我得到

TypeError: sequence item 0: expected string, Tag found

來源

2012-07-11 22:39:52

LXML會隨時添加這些標籤，但你可以使用Tag.extract()從裏面他們刪除您<div>標籤：

try: 
    children = soup.body.contents 
    string = '' 
    for child in children: 
     string += str(item) 
    return string 
except AttributeError: 
    return str(soup)

我想：

comment = soup.body.div.extract()

來源

2012-07-01 15:19:50

使用

soup.body.renderContents()

來源

2012-12-05 09:22:00

BeautifulSoup：解析頁

回答

相關問題