解析python中的損壞的html頁面

我想解析一個破解的html頁面，其中有註釋，並且所有着名的htmlparsers像beautifulsoup，lxml和HTMLParser都給出了語法錯誤。以下是代碼。如何忽略損壞的代碼部分並解析頁面的其餘部分？解析python中的損壞的html頁面

<html xmlns="http://www.w3.org/1999/xhtml"><head> 

<script language="JavaScript"> 
<!-- 
    function setTimeOffsetVars (Link) { 
    // code removed 
} 

<!-- Image Preloader - takes an array of images to preload --> 
    function warningCheck(e, warnMsg) { 
    // code removed 
} 
--> 
</script> 

</head> 

<body topmargin="0" leftmargin="0" rightmargin="0" bottommargin="0" marginwidth="0" marginheight="0"> 
<!-- lot of useful code --> 
</body></html>

來源

2012-12-26 raju

我對這個html沒有任何錯誤。我嘗試了beautifulsoup4和lxml。

from bs4 import BeautifulSoup 
soup = BeautifulSoup(s) 
print soup.prettify() 


<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
    <script language="JavaScript"> 
    &lt;!-- 
    function setTimeOffsetVars (Link) { 
    // code removed 
} 

&lt;!-- Image Preloader - takes an array of images to preload --&gt; 
    function warningCheck(e, warnMsg) { 
    // code removed 
} 
--&gt; 
    </script> 
</head> 
<body bottommargin="0" leftmargin="0" marginheight="0" marginwidth="0" rightmargin="0" topmargin="0"> 
    <!-- lot of useful code --> 
</body> 
</html>

來源

2012-12-26 09:00:59 sneawo

如果你知道問題是什麼，你可以預處理：先用原始的方法像正則表達式的剝離問題的內部意見，然後用一個真正的分析器打它。

來源

2012-12-26 08:24:05 Amadan

解析python中的損壞的html頁面

回答

相關問題