爲什麼PortStemmer在NLTK我的「字符串」轉換爲U 「串」

import nltk 
import string 
from nltk.corpus import stopwords 


from collections import Counter 

def get_tokens(): 
    with open('comet_interest.xml','r') as bookmark: 
     text=bookmark.read() 
     lowers=text.lower() 

     no_punctuation=lowers.translate(None,string.punctuation) 
     tokens=nltk.word_tokenize(no_punctuation) 
     return tokens 
#remove stopwords 
tokens=get_tokens() 
filtered = [w for w in tokens if not w in stopwords.words('english')] 
count = Counter(filtered) 
print count.most_common(10) 

#stemming 
from nltk.stem.porter import * 

def stem_tokens(tokens, stemmer): 
    stemmed = [] 
    for item in tokens: 
     stemmed.append(stemmer.stem(item)) 
    return stemmed 

stemmer = PorterStemmer() 
stemmed = stem_tokens(filtered, stemmer) 
count = Counter(stemmed) 
print count.most_common(10)

結果表明這樣的：爲什麼PortStemmer在NLTK我的「字符串」轉換爲U 「串」

[（ '分析'，13），（ '空間'，11），（」（''''''），（''''''），（''''''），（''''''），，（u'spatial'，11），（u'use'，11），（u'feb'，11），（u'spatial'，11）， 8），（u'cdata'，8），（u'scienc'，7），（u'descript'，7），（u'item'，6），（u'includ'，6） 'mani'，6）]

seco有什麼問題找出一個詞，爲什麼每個詞都有一個「你」的頭？

來源

2015-02-12 Hao Wu

...因爲它們是Unicode字符串？ – kindall 2015-02-12 04:20:42

哦。但爲什麼第一個不是Unicode？以及如何從Unicode轉換爲字符串？ – 2015-02-12 16:49:56

正如@kindall指出的那樣，它是因爲unicode字符串。

但更具體地講，這是因爲NLTK使用from __future__ import unicode_literals它轉換ALL字符串默認爲Unicode，看到https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L87

所以讓我們嘗試在Python 2.x的一個實驗：

$ python 
>>> from nltk.stem import PorterStemmer 
>>> porter = PorterStemmer() 
>>> word = "analysis" 
>>> word 
'analysis' 
>>> porter.stem(word) 
u'analysi'

我們看到突然間，這個詞語變成了一個unicode。

然後，讓我們嘗試導入unicode_literals：

>>> from nltk.stem import PorterStemmer 
>>> porter = PorterStemmer() 
>>> word = "analysis" 
>>> word 
'analysis' 
>>> porter.stem(word) 
u'analysi' 
>>> from __future__ import print_function, unicode_literals 
>>> word 
'analysis' 
>>> word2 = "analysis" 
>>> word2 
u'analysis'

注意，所有字符串仍然爲字符串，但是任何字符串變量，這是進口unicode_literals會默認成爲統一後的新。

來源

2015-02-12 20:56:47 alvas

這對您的其他應用程序或您使用詞幹結果的任何地方意味着什麼？ – gorjanz 2017-02-15 10:30:26

爲什麼PortStemmer在NLTK我的 「字符串」 轉換爲U 「串」

回答

相關問題

爲什麼PortStemmer在NLTK我的「字符串」轉換爲U 「串」