我使用的雪球詞幹如下圖所示的代碼片段以遏制文檔的單詞。的Python NLTK雪球詞幹的UnicodeDecodeError在終端而不是Eclipse的PyDev的
stemmer = EnglishStemmer()
# Stem, lowercase, substitute all punctuations, remove stopwords.
attribute_names = [stemmer.stem(token.lower()) for token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')]
當我在Eclipse中使用PyDev在文檔上運行時,我沒有收到任何錯誤。當我在終端(Mac OSX)中運行它時,我收到下面的錯誤消息。有人可以幫忙嗎?
File "data_processing.py", line 171, in __filter__
attribute_names = [stemmer.stem(token.lower()) for token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')]
File "7.3/lib/python2.7/site-packages/nltk-2.0.4-py2.7.egg/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)