Python：我使用.decode（） - 'ascii'編解碼器無法編碼

這似乎是我使用了錯誤的函數。隨着.fromstring - 那裏是沒有錯誤消息Python：我使用.decode（） - 'ascii'編解碼器無法編碼

xml_ = load() # here comes the unicode string with Cyrillic letters 

print xml_ # prints everything fine 

print type(xml_) # 'lxml.etree._ElementUnicodeResult' = unicode 

xml = xml_.decode('utf-8') # here is an error 

doc = lxml.etree.parse(xml) # if I do not decode it - the same error appears here 

File "testLog.py", line 48, in <module> 
    xml = xml_.decode('utf-8') 
    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 
UnicodeEncodeError: 'ascii' codec can't encode characters in position 89-96: ordinal not in range(128)

如果

xml = xml_.encode('utf-8') 

doc = lxml.etree.parse(xml) # here's an error

或

xml = xml_

然後

UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 89: ordinal not in range(128)

如果我的理解對不對：我米ust將非ASCII字符串解碼爲內部表示形式，然後使用這種表示形式並在發送到輸出之前對其進行編碼？看來我正是這樣做的。

由於標頭爲'Accept-Charset': 'utf-8'，輸入數據必須位於非8位。

來源

2012-07-08 Ben Usman

錯誤仍然是關於etree.parse（）調用上的字符編碼？什麼是XML的類型？ etree.parse在字符串或unicode對象上不起作用。嘗試使用etree.fromstring（）代替。 – hasanyasin 2012-07-08 18:06:18

@hasanyasin，看起來你是對的。 :) – 2012-07-08 18:08:24

我會寫一個很好的答案，涵蓋希望你會接受的兩個問題是正確的答案。 :) – hasanyasin 2012-07-08 18:09:19

對我而言，使用.fromstring()方法是需要的。

來源

2014-03-18 20:15:01

如果您的原始字符串是unicode，則只有將它編碼爲utf-8才能解碼utf-8。

我認爲xml解析器只能處理ascii的xml。

因此，請使用xml = xml_.encode('ascii','xmlcharrefreplace')將不在ascii中的unicode字符轉換爲xml實體。

來源

2012-07-08 17:56:53

然後同樣的錯誤出現一個字符串較低。 – 2012-07-08 17:58:23

我現在明白了。請看看編輯過的問題。 – 2012-07-08 18:05:30

@hasanyasin：我將unicode字符串編碼爲ascii編碼中的字節。這很可能。西里爾字符串被翻譯成xml實體。例如'Ж'成爲'Ж'。 – 2012-07-08 18:32:43

lxml庫已經將東西放到unicode類型中。你正在運行python2的unicode/bytes自動轉換。其中的提示是，你問它decode，但你得到一個編碼錯誤。它試圖將您的utf8字符串轉換爲默認字節編碼，然後將其解碼回unicode。

使用unicode對象上的.encode方法轉換爲字節（str類型）。

看着這會教你很多關於如何解決這個問題：http://nedbatchelder.com/text/unipain.html

來源

2012-07-08 17:58:14 Daenyth

我假設你正在試圖解析一些網站？

您是否有效該網站是正確的？也許他們的編碼是不正確的？

許多網站被打破，並依靠網絡瀏覽器有很健壯的分析器。你可以嘗試一下，它也很健壯。

有事實上的網絡標準，在「字符集」 HTML頭（其中可能包括談判和涉及接受編碼你提到）是任何<meta http-equiv=...標籤在HTML文件中否決！

所以你可能只是不是有一個UTF-8輸入！

來源

2012-07-08 18:06:54

字符串和Unicode對象在內存中具有不同的類型和不同的內容表示形式。 Unicode是文本的解碼形式，而字符串是編碼形式。

# -*- coding: utf-8 -- 

# Now, my string literals in this source file will 
# be str objects encoded in utf-8. 

# In Python3, they will be unicode objects. 
# Below examples show the Python2 way. 

s = 'ş' 
print type(s) # prints <type 'str'> 

u = s.decode('utf-8') 
# Here, we create a unicode object from a string 
# which was encoded in utf-8. 

print type(u) # prints <type 'unicode'>

正如你看到的，

.encode() --> str 
.decode() --> unicode

當我們編碼或解碼的字符串，我們需要確保我們的文本應在源/目標編碼覆蓋。 iso-8859-1編碼的字符串不能用iso-8859-9正確解碼。

至於問題中的第二個錯誤報告，lxml.etree.parse()對文件類對象有效。要從字符串解析，應使用lxml.etree.fromstring()。

來源

2012-07-08 18:21:17 hasanyasin

Python：我使用.decode（） - 'ascii'編解碼器無法編碼

回答

相關問題