使用BeautifulSoup清理html文檔和多個段落

我有一個html文檔，我可以使用BeautifulSoup獲取其元素，並提取文本。但我的問題是試圖使用「getText（）」方法獲取文檔的主體，它合併所有段落並返回一行。我嘗試了不同的方式來獲得單獨的段落，但沒有成功。該文件的格式是：使用BeautifulSoup清理html文檔和多個段落

<div class="body" style="text-align: justify;padding: 20px;"> <div align="justify"><span style="font-weight: bold; color: rgb(128, 0, 0);"><img style="border: medium none; margin-left: 10px;" alt="" title="" src="/files/7/7/86119_216.jpg" align="right">ABC-</span>Paragraph 1<br><br>Paragraph 2<br><br>Paragraph 3<br><br><span style="font-weight: bold;">Paragraph 4</span><br>Paragraph 5 <span style="font-weight: bold; font-style: italic; text-decoration: underline; color: rgb(128, 0, 0);">Paragraph 6</span>Paragraph <br><br>Paragraph</div> <div class="wrapper"></div> </div> </div>

我目前使用獲得此文件的正文爲：

soup = BeautifulSoup(page) 
body = soup.find("div", {"class":"body"})

到這裏一切正常。我現在的問題是如何獲得正文中的段落。任何想法？

試圖處理另一個html文件，我得到另一個問題提取段落。這個新頁面的格式是：

<div class="detailCont"> 
    <span>News agency:</span> 
    <h2> 
     Header 

    </h2> 
     <div> 
      <img class="showNewsImg" src="http://images.agency.com/images/position36/2013/9/khrid_hvapyma-910407-as.jpg" /> 
     </div> 

    <div class="lead"> 
     <span>additional info</span>- 
     agency:<br />Paragraph 1 
    </div> 

    <p>Paragraph 2</p> 
    <p>Paragraph 3</p> 
    <p>Paragraph 4</p> 
    <p>Paragraph 5</p> 
    </div>

我需要的所有數據都在這個部分。所以我可以使用下面的命令得到這部分：

doc = soup.find("div", {"class":"detailCont"})

其中包含聽到和正文。爲了得到標頭，我用下面的命令：

header = doc.h2

，但我不知道我怎麼能得到的只是身體。任何想法？最好。

來源

2013-09-29 amin

''
不嚴格地說段落分隔符。 – tripleee

使用列表理解：

[s for s in body.strings if s.strip() != '']

它產生：

['ABC-', 
'Paragraph 1', 
'Paragraph 2', 
'Paragraph 3', 
'Paragraph 4', 
'Paragraph 5 ', 
'Paragraph 6', 
'Paragraph ', 
'Paragraph']

來源

2013-09-29 21:00:07 Birei

感謝親愛的Birei，它工作。 ;-) – amin

使用BeautifulSoup清理html文檔和多個段落

回答

相關問題