0
我試圖建立一個RNN模型,將評論分爲正面或負面情緒。Imdb評論編碼錯誤
有一個詞彙的詞彙,在預處理過程中,我對一些索引序列進行了回顧。
例如,
"This movie was best" --> [2,5,10,3]
當我試圖讓頻繁vocabs並查看其內容,我得到這個錯誤:
num of reviews 100
number of unique tokens : 4761
Traceback (most recent call last):
File "preprocess.py", line 47, in <module>
print(vocab)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 10561: ordinal not in range(128)
代碼如下所示:
from bs4 import BeautifulSoup
reviews = []
for item in os.listdir('imdbdata/train/pos')[:100]:
with open("imdbdata/train/pos/"+item,'r',encoding='utf-8') as f:
sample = BeautifulSoup(f.read()).get_text()
sample = word_tokenize(sample.lower())
reviews.append(sample)
print("num of reviews", len(reviews))
word_freq = nltk.FreqDist(itertools.chain(*reviews))
print("number of unique tokens : %d"%(len(word_freq.items())))
vocab = word_freq.most_common(vocab_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict((w,i) for i,w in enumerate(index_to_word))
print(vocab)
問題是,當我用Python處理NLP問題時,如何才能擺脫這個UnicodeEncodeError
?特別是在使用open
函數獲取文本時。
這是我正在尋找的答案!謝謝。 –