Python 2.7 - 查找UTF-8字符

from urllib import urlopen 
web = urlopen("http://typographyforlawyers.com/straight-and-curly- 
quotes.html").read() 
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"')

「\ xe2 \ x80 \ x9c」是捲曲引號的UTF-8字符。當我試圖找到一個網站彎引號使用此代碼，我得到這個錯誤：Python 2.7 - 查找UTF-8字符

Traceback (most recent call last): 
File "<pyshell#4>", line 1, in <module> 
web = web.replace("\xe2\x80\x9c".decode('utf8'), '"') 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2265: 
ordinal not in range(128)

這個錯誤是什麼意思，我在做什麼錯了，我該如何解決？

來源

2017-05-08 Dman42

您必須使用decode（'utf-8'）來解碼字符串。

from urllib import urlopen 

web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read().decode('utf-8') 
web = web.replace(b"\xe2\x80\x9c".decode('utf8'), '"') 

print(web)

來源

2017-05-08 01:56:27

我喜歡這個答案，但解釋可以更明確。它將Web響應轉換爲unicode，並使用「bytes」對象進行解碼，因此沒有理由觸摸ascii編解碼器。還應該提到的是，HTML文檔經常把它們的編碼放在''標籤和'utf-8'中可能不是正確的猜測。它通常是正確的，但不能保證。 – tdelaney

非常感謝你的回答，這也是有道理的。 – Dman42

這是由於使用「ASCII」編解碼器作爲默認字符串文字Python 2解釋。在將來的代碼（Python 3）中，默認值是utf-8，並且您的代碼中可以使用Unicode字符。你現在可以用你的Python 2，使用未來的導入。

from __future__ import unicode_literals 

from urllib import urlopen 

web = urlopen("http://typographyforlawyers.com/straight-and-curly-quotes.html").read() 
web = web.decode("utf-8") 
web = web.replace('「' , '"') 

print(repr(web))

來源

2017-05-08 02:18:39 Keith

'unicode_literal'有許多潛在的缺陷。 'web.decode（「utf-8」）'解決了這個問題。其餘的風險很大。 – tdelaney

我不清楚OP實際需要什麼。但是'unicode_literal'可能有缺陷，但是如果您正確使用它，則不會。對於像這樣的小腳本來說，它會很好。 – Keith

請注意，這是一個python 2解決方案。 Python 3以不同方式處理字符串和字節。

我可以重現該問題與

>>> web = "0123\xe2\x80\x9c789" 
>>> web.replace("\xe2\x80\x9c".decode('utf-8'), '"') 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)

你讀出的編碼字符串轉換成web，我只是做了一個簡單的測試。當你解碼搜索字符串時，你創建了一個unicode對象。爲了替換工作，需要將web轉換爲unicode。

>>> "\xe2\x80\x9c".decode('utf-8') 
u'\u201c' 
>>> unicode(web) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)

這是web轉換是得了你。在Python 2中，str可以保存編碼字節 - 這正是你在這裏所得到的。一種選擇是隻替換編碼的字節序列

>>> web.replace("\xe2\x80\x9c", '"') 
'0123"789'

這隻適用，因爲你知道頁面是用UTF-8編碼的。這通常是這種情況，但值得一提。

來源

2017-05-08 02:48:58 tdelaney

Python 2.7 - 查找UTF-8字符

回答

相關問題