從NLTK Collocations中找到Trigrams時獲取無法識別的單詞

我正在使用NLTK Collocations查找三字母單詞，'training_set'是一個包含多行文本的字符串。從NLTK Collocations中找到Trigrams時獲取無法識別的單詞

finder = TrigramCollocationFinder.from_words(str(training_set)) 
print finder.nbest(trigram_measures.pmi, 5)

但我得到的輸出作爲

[('\xe5', '\x8d', '\xb8'), ('\xe5', '\x85', '\x8d'), ('\xe2', '\x80', '\x9c'), ('\xe2', '\x80', '\x9d'), ('\xe2', '\x80', '\xa6')]

這是一些編碼的問題？我如何獲得正常的英語單詞？

來源

2014-09-05 Shivendra

是的，那些看起來像「窗口1252」編碼字符：

>>> import chardet 

>>> chardet.detect('\xe5') {'confidence': 0.5, 'encoding': 'windows-1252'}

所以，如果你不希望這些展現出來，你可以做這樣的事情給你的文字：

>> '\xe5'.decode('windows-1252').encode('ascii', 'ignore')

來源

2014-09-09 16:12:13 leavesof3

運行解碼和編碼腳本給出一個空字符串。 – Shivendra 2014-09-10 06:11:07

嗯，他們不會是英文單詞的，因爲他們是外國字符。只需省略編碼部分即可得到實際的字母。 >>> print'\ xe5'.decode（'windows-1252'） å。它也看起來像你所擁有的不是三言兩語，而是單個字母。在將文本發送到TrigramCollocationFinder之前，您可能必須標記文本。 – leavesof3 2014-09-12 03:14:09

finder = TrigramCollocationFinder.from_words（nltk.word_tokenize（str（training_set））） – leavesof3 2014-09-12 03:35:37

從NLTK Collocations中找到Trigrams時獲取無法識別的單詞

回答

相關問題