2015-12-25 110 views
0

BeautifulSoup無法解析帶有選項html5lib的html頁面,但可以正常使用html.parser選項。根據docshtml5lib應該比html.parser更寬鬆,那爲什麼我在使用它解析html頁面時遇到了亂碼?BeautifulSoup無法用`html5lib`解析html

下面是一個小的可執行例子。(改html5libhtml.parser後,中國輸出是否正常。)

#_*_coding:utf-8_*_ 
import requests 
from bs4 import BeautifulSoup 

ss = requests.Session() 
res = ss.get("http://tech.qq.com/a/20151225/050487.htm") 
html = res.content.decode("GBK").encode("utf-8") 
soup = BeautifulSoup(html, 'html5lib') 
print str(soup)[0:800] # where you can see if the html is parsed normally or not 

回答

1

不要重新編碼您的內容。離開處理解碼Beautifulsoup:

soup = BeautifulSoup(res.content, 'html5lib') 

如果你要重新編碼,您需要更換meta頭這是存在於源:

<meta http-equiv="Content-Type" content="text/html; charset=gb2312"> 

或手動解碼並傳遞統一:

soup = BeautifulSoup(res.content.decode('gbk'), 'html5lib')