lxml中的HTML元素得到錯誤編碼，如Н а й

我需要從網頁打印RSS鏈接，但此鏈接解碼不正確。這裏是我的代碼：lxml中的HTML元素得到錯誤編碼，如Н а й

import urllib2 
from lxml import html, etree 
import chardet 

data = urllib2.urlopen('http://facts-and-joy.ru/') 
S=data.read() 
encoding = chardet.detect(S)['encoding'] 
#S=S.decode(encoding) 
#encoding='utf-8' 

print encoding 
parser = html.HTMLParser(encoding=encoding) 
content = html.document_fromstring(S,parser) 
loLinks = content.xpath('//link[@type="application/rss+xml"]') 

for oLink in loLinks: 
    print oLink.xpath('@title')[0] 
    print etree.tostring(oLink,encoding='utf-8')

這裏是我的輸出：

utf-8 
Позитивное мышление RSS Feed 
<link rel="alternate" type="application/rss+xml" title="&#x41F;&#x43E;&#x437;&#x438;&#x442;&#x438;&#x432;&#x43D;&#x43E;&#x435; &#x43C;&#x44B;&#x448;&#x43B;&#x435;&#x43D;&#x438;&#x435; RSS Feed" href="http://facts-and-joy.ru/feed/" />&#13;

標題內容得到正確本身顯示出來，但裏面的ToString（）它得到了奇怪的&＃...符號代替。如何正確打印整個鏈接元素？

在此先感謝您的幫助！

來源

2013-10-03 Apogentus

不確定爲什麼你認爲這是不正確的。據我所知，這些文本是俄文的UTF-8編碼。你期待它給你什麼？ –

我想操作系統預計實際的俄羅斯字符，而不是字符引用：П而不是'П'等 – mzjn

是的，我想看到實際的俄羅斯字符。 – Apogentus

這是你的程序的簡化版本，工作原理：

from lxml import html 

url = 'http://facts-and-joy.ru/' 
content = html.parse(url) 
rsslinks = content.xpath('//link[@type="application/rss+xml"]') 

for link in rsslinks: 
    print link.get('title') 
    print html.tostring(link, encoding="utf-8")

輸出：

Позитивное мышление RSS Feed 
<link rel="alternate" type="application/rss+xml" title="Позитивное мышление RSS Feed" href="http://facts-and-joy.ru/feed/">&#13;

關鍵線路

print html.tostring(link, encoding="utf-8")

這是唯一的事情你必須改變你的原始程序。

使用html.tostring()而不是etree.tostring()會生成實際字符而不是數字字符引用。您也可以使用etree.tostring(link, method="html", encoding="utf-8")。

目前還不清楚爲什麼「html」和「xml」輸出方法之間存在差異。這篇文章給lxml郵件列表沒有得到任何回覆：https://mailman-mail5.webfaction.com/pipermail/lxml/2011-September/006131.html。

來源

2013-10-03 20:54:33 mzjn

lxml中的HTML元素得到錯誤編碼，如Н а й

回答

相關問題