Imdb評論編碼錯誤

我試圖建立一個RNN模型，將評論分爲正面或負面情緒。Imdb評論編碼錯誤

有一個詞彙的詞彙，在預處理過程中，我對一些索引序列進行了回顧。
例如，

"This movie was best" --> [2,5,10,3]

當我試圖讓頻繁vocabs並查看其內容，我得到這個錯誤：

num of reviews 100 
number of unique tokens : 4761 
Traceback (most recent call last): 
    File "preprocess.py", line 47, in <module> 
    print(vocab) 
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 10561: ordinal not in range(128)

代碼如下所示：

from bs4 import BeautifulSoup 
reviews = [] 
for item in os.listdir('imdbdata/train/pos')[:100]: 
    with open("imdbdata/train/pos/"+item,'r',encoding='utf-8') as f: 
     sample = BeautifulSoup(f.read()).get_text() 
    sample = word_tokenize(sample.lower()) 
    reviews.append(sample) 
print("num of reviews", len(reviews)) 
word_freq = nltk.FreqDist(itertools.chain(*reviews)) 
print("number of unique tokens : %d"%(len(word_freq.items()))) 
vocab = word_freq.most_common(vocab_size-1) 
index_to_word = [x[0] for x in vocab] 
index_to_word.append(unknown_token) 
word_to_index = dict((w,i) for i,w in enumerate(index_to_word)) 
print(vocab)

問題是，當我用Python處理NLP問題時，如何才能擺脫這個UnicodeEncodeError？特別是在使用open函數獲取文本時。

來源

2017-10-09 Peter Kim

它看起來像您的終端配置爲ASCII。由於字符'\xe9'不在ASCII字符範圍（0x00-0x7F）之內，因此無法在ASCII終端上打印。它還不能被編碼爲ASCII：

>>> s = '\xe9' 
>>> s.encode('ascii') 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 0: ordinal not in range(128)

您可以解決此通過明確編碼在打印時的字符串，並用?更換不支持的字符處理編碼錯誤：

>>> print(s.encode('ascii', errors='replace')) 
b'?'

字符看起來就像ISO-8859-1編碼的小寫字母e（e）一樣。

您可以檢查用於標準輸出的編碼。在我的情況下，它是UTF-8，和我沒有問題，打印該字符：

>>> import sys 
>>> sys.stdout.encoding 
'UTF-8' 
>>> print('\xe9') 
é

你也許能夠強迫的Python到使用不同的默認編碼;有一些討論here，但最好的方法是使用支持UTF-8的終端。

來源

2017-10-09 10:41:17 mhawke

這是我正在尋找的答案！謝謝。 –

Imdb評論編碼錯誤

回答

相關問題