2013-02-15 50 views
1

運行這段代碼解析錯誤:與BeautifulSoup4和Python的3.3

from bs4 import BeautifulSoup 
soup = BeautifulSoup (open("my.html")) 
print(soup.prettify()) 

產生以下錯誤:

Traceback (most recent call last): 
    File "soup.py", line 5, in <module> 
    print(soup.prettify()) 
    File "C:\Python33\lib\encodings\cp437.py", line 19, in encode 
    return codecs.charmap_encode(input,self.errors,encoding_map)[0] 
UnicodeEncodeError: 'charmap' codec can't encode character '\u25ba' in position 
9001: character maps to <undefined> 

然後我想:

print(soup.encode('UTF-8').prettify()) 

但這種失敗是考慮串使用字節對象格式化:

Traceback (most recent call last): 
    File "soup.py", line 11, in <module> 
    print(soup.encode('UTF-8').prettify()) 
AttributeError: 'bytes' object has no attribute 'prettify' 

不知道如何去解決這個問題。任何投入將不勝感激。

+0

嘗試從字節串首先解碼:bytes.decode(my.html) – 2013-02-15 06:22:43

+0

我無法使這個工作與美麗的湯(AttributeError:'str'對象沒有屬性...) – Jim 2013-02-15 16:32:38

回答

3

您的(Windows)控制檯正在使用cp437編碼,並且湯中有一個字符不被該編碼支持。在這種情況下默認是拋出一個異常,但你可以改變它。

import sys,io 
from bs4 import BeautifulSoup 
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,'cp437','backslashreplace') 
soup = BeautifulSoup (open("my.html")) 
print(soup.prettify()) 

或者,寫湯到一個文件,與支持編碼的編輯器讀:

# On Windows, utf-8-sig will allow the file to be read by Notepad. 
with open('out.txt','w',encoding='utf-8-sig') as f: 
    f.write(soup.prettify()) 
+0

這兩個解決方案工作完美,謝謝。 – Jim 2013-02-15 16:10:42