BeautifulSoup（BS4）解析錯誤

解析與BS4此示例文件，從蟒蛇2.7.6：BeautifulSoup（BS4）解析錯誤

<html> 
<body> 
<p>HTML allows omitting P end-tags. 

<p>Like that and this. 

<p>And this, too. 

<p>What happened?</p> 

<p>And can we <p>nest a paragraph, too?</p></p> 

</body> 
</html>

使用：

from bs4 import BeautifulSoup as BS 
... 
tree = BS(fh)

HTML有，望穿秋水，允許省略結束標籤各種元素類型，包括P（檢查模式或解析器）。然而，BS4的美化（）這份文件表明，它並沒有結束任何這些段落，直到它看到</BODY>：

<html> 
<body> 
    <p> 
    HTML allows omitting P end-tags. 
    <p> 
    Like that and this. 
    <p> 
    And this, too. 
    <p> 
     What happened? 
    </p> 
    <p> 
     And can we 
     <p> 
     nest a paragraph, too? 
     </p> 
    </p> 
    </p> 
    </p> 
    </p> 
</body>

這不是美化（）的錯，因爲手動遍歷樹我得到同樣的結構：

<[document]> 
    <html> 
     ␊ 
     <body> 
      ␊ 
      <p> 
       HTML allows omitting P end-tags.␊␊ 
       <p> 
        Like that and this.␊␊ 
        <p> 
         And this, too.␊␊ 
         <p> 
          What happened? 
         </p> 
         ␊ 
         <p> 
          And can we 
          <p> 
           nest a paragraph, too? 
          </p> 
         </p> 
         ␊ 
        </p> 
       </p> 
      </p> 
     </body> 
     ␊ 
    </html> 
    ␊ 
</[document]>

現在，這將是XML正確的結果（至少到</BODY>，此時它應該報告WF錯誤）。但這不是XML。是什麼賦予了？

來源

2015-04-29 TextGeek

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser的文檔講述瞭如何讓BS4使用不同的分析器。顯然默認是html.parse，BS4 doc在Python 2.7.3之前已經破解了，但顯然仍然存在2.7.6中所述的問題。

切換到「LXML」失敗對我來說，但切換到「html5lib」產生正確的結果：

tree = BS(htmSource, "html5lib")

來源

2015-05-06 17:28:46 TextGeek

BeautifulSoup（BS4）解析錯誤

回答

相關問題