如何處理UTF-8編碼的字符串和BeautifulSoup？

如何用unicode替換unicode-Strings中的HTML實體？如何處理UTF-8編碼的字符串和BeautifulSoup？

u'&quot;HAUS Kleider&quot; - &Uuml;ber das Bekleiden und Entkleiden, das Verh&Yuml;llen und Veredeln'

到

u'"HAUS-Kleider" - Über das Bekleiden und Entkleiden, das Verhüllen und Veredeln'

編輯
其實實體是錯誤的。看起來像BeautifulSoup f ...編輯它。

所以問題是：如何處理UTF-8編碼的字符串和BeautifulSoup？

from BeautifulSoup import BeautifulSoup 

f = open('path_to_file','r') 
lines = [i for i in f.readlines()] 
soup = BeautifulSoup(''.join(lines)) 
allArticles = [] 
for row in rows: 
    l =[] 
    for r in row.findAll('td'): 
      l += [r.string] # here things seem to go wrong 
    allArticles+=[l]

Ü -> &Yuml;，而不是Ü但實際上我不希望編碼仍然會改變。

>>> soup.originalEncoding 
'utf-8'

，但我不能產生它的正確Unicode字符串

來源

2010-10-29 vikingosegundo

可能重複[在Python字符串中解碼HTML實體？]（http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string） – geoffspear 2010-10-29 18:02:15

事情似乎出錯了？ BeautifulSoup發起了它？這些實體是錯誤的？請嘗試提供更精確的詳細信息以使此問題可以回覆。 BeautifulSoup傾向於很好地處理UTF-8。 – 2010-10-29 18:20:23

好的，問題很愚蠢，我不得不承認。我在交互式解釋器中編寫舊版本rows。我不知道它的內容有什麼問題，但這是正確的代碼：

from BeautifulSoup import BeautifulSoup 

f = open('path_to_file','r') 
lines = [i for i in f.readlines()] 
soup = BeautifulSoup(''.join(lines)) 
rows = soup.findAll('tr') 
allArticles = [] 
for row in rows: 
    l =[] 
    for r in row.findAll('td'): 
     l += [r.string] 
    allArticles+=[l]

對我感到羞恥！

來源

2010-10-29 19:24:22 vikingosegundo

我想你需要的是ICU transliterators。我認爲有一種方法可以將 HTML實體音譯爲Unicode。

試試音譯器編號Hex/XML-Any這應該是你想要的。在Demo頁面中，您可以選擇「插入樣品：化合物」，然後在「化合物1」框中輸入Hex/XML-Any，在框中添加一些輸入數據並按下「變換」。 this有幫助嗎？

有一個Python ICU的綁定，但我認爲它沒有得到很好的照顧。

來源

2010-10-29 18:08:36 towi

htmlentitydefs.entitydefs["quot"]回報 '"'
就是這樣轉換實體到他們的實際字符的字典。
您應該能夠從這一點輕鬆地繼續。

來源

2010-10-29 18:24:01 BlueTrance

如果BeautifulSoup會給我正確的實體。看我的編輯 – vikingosegundo 2010-10-29 18:25:05

如何處理UTF-8編碼的字符串和BeautifulSoup？

回答

相關問題