2015-04-29 53 views
3

解析與BS4此示例文件,從蟒蛇2.7.6:BeautifulSoup(BS4)解析錯誤

<html> 
<body> 
<p>HTML allows omitting P end-tags. 

<p>Like that and this. 

<p>And this, too. 

<p>What happened?</p> 

<p>And can we <p>nest a paragraph, too?</p></p> 

</body> 
</html> 

使用:

from bs4 import BeautifulSoup as BS 
... 
tree = BS(fh) 

HTML有,望穿秋水,允許省略結束標籤各種元素類型,包括P(檢查模式或解析器)。然而,BS4的美化()這份文件表明,它並沒有結束任何這些段落,直到它看到</BODY>:

<html> 
<body> 
    <p> 
    HTML allows omitting P end-tags. 
    <p> 
    Like that and this. 
    <p> 
    And this, too. 
    <p> 
     What happened? 
    </p> 
    <p> 
     And can we 
     <p> 
     nest a paragraph, too? 
     </p> 
    </p> 
    </p> 
    </p> 
    </p> 
</body> 

這不是美化()的錯,因爲手動遍歷樹我得到同樣的結構:

<[document]> 
    <html> 
     ␊ 
     <body> 
      ␊ 
      <p> 
       HTML allows omitting P end-tags.␊␊ 
       <p> 
        Like that and this.␊␊ 
        <p> 
         And this, too.␊␊ 
         <p> 
          What happened? 
         </p> 
         ␊ 
         <p> 
          And can we 
          <p> 
           nest a paragraph, too? 
          </p> 
         </p> 
         ␊ 
        </p> 
       </p> 
      </p> 
     </body> 
     ␊ 
    </html> 
    ␊ 
</[document]> 

現在,這將是XML正確的結果(至少到</BODY>,此時它應該報告WF錯誤)。但這不是XML。是什麼賦予了?

回答