向CountVectorizer添加詞幹支持（sklearn）

我試圖在sklearn中將詞幹添加到NLP中的管道中。向CountVectorizer添加詞幹支持（sklearn）

from nltk.stem.snowball import FrenchStemmer 

stop = stopwords.words('french') 
stemmer = FrenchStemmer() 


class StemmedCountVectorizer(CountVectorizer): 
    def __init__(self, stemmer): 
     super(StemmedCountVectorizer, self).__init__() 
     self.stemmer = stemmer 

    def build_analyzer(self): 
     analyzer = super(StemmedCountVectorizer, self).build_analyzer() 
     return lambda doc:(self.stemmer.stem(w) for w in analyzer(doc)) 

stem_vectorizer = StemmedCountVectorizer(stemmer) 
text_clf = Pipeline([('vect', stem_vectorizer), ('tfidf', TfidfTransformer()), ('clf', SVC(kernel='linear', C=1)) ])

在sklearn的CountVectorizer中使用此管道時，它可以工作。如果我手動創建這樣的功能，它也可以。

vectorizer = StemmedCountVectorizer(stemmer) 
vectorizer.fit_transform(X) 
tfidf_transformer = TfidfTransformer() 
X_tfidf = tfidf_transformer.fit_transform(X_counts)

編輯：

如果我試圖在我的IPython的筆記本電腦這條管道也顯示[*]並沒有任何反應。當我看着我的終端，它給這個錯誤：

Process PoolWorker-12: 
Traceback (most recent call last): 
    File "C:\Anaconda2\lib\multiprocessing\process.py", line 258, in _bootstrap 
    self.run() 
    File "C:\Anaconda2\lib\multiprocessing\process.py", line 114, in run 
    self._target(*self._args, **self._kwargs) 
    File "C:\Anaconda2\lib\multiprocessing\pool.py", line 102, in worker 
    task = get() 
    File "C:\Anaconda2\lib\site-packages\sklearn\externals\joblib\pool.py", line 360, in get 
    return recv() 
AttributeError: 'module' object has no attribute 'StemmedCountVectorizer'

例

下面是完整的例子

from sklearn.pipeline import Pipeline 
from sklearn import grid_search 
from sklearn.svm import SVC 
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer 
from nltk.stem.snowball import FrenchStemmer 

stemmer = FrenchStemmer() 
analyzer = CountVectorizer().build_analyzer() 

def stemming(doc): 
    return (stemmer.stem(w) for w in analyzer(doc)) 

X = ['le chat est beau', 'le ciel est nuageux', 'les gens sont gentils', 'Paris est magique', 'Marseille est tragique', 'JCVD est fou'] 
Y = [1,0,1,1,0,0] 

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', SVC())]) 
parameters = { 'vect__analyzer': ['word', stemming]} 

gs_clf = grid_search.GridSearchCV(text_clf, parameters, n_jobs=-1) 
gs_clf.fit(X, Y)

如果您從這些參數所產生它的工作原理，否則它不起作用。

UPDATE：

這個問題似乎是因爲去除n_jobs = -1問題消失時是在平行化處理。

來源

2016-03-23 dooms

這似乎是一個問題用酸洗和取消範圍。例如，如果將'stemming'放入導入的模塊中，則它會更可靠地取出。 – joeln

您能否提供一個示例或鏈接以瞭解您所說的內容？如何在輸入模塊中添加「stemming」？因爲沒有並行化，GridSearch很慢，需要調整一些參數。 – dooms

對於它的價值，我可以毫無問題地運行您的完整示例。但我的意思是將'stemming'的代碼移動到'myutils.py'中，並使用'from myutils import stemming'。 – joeln

你可以試試：

def build_analyzer(self): 
    analyzer = super(CountVectorizer, self).build_analyzer() 
    return lambda doc:(stemmer.stem(w) for w in analyzer(doc))

並刪除__init__方法。

來源

2016-03-23 16:17:21 Till

它不起作用（給出相同的錯誤），我需要stemmer屬性。 – dooms

您能否提供有關打印錯誤的更多信息？例如哪條線斷？ – Till

我正在使用一個GridSearch和n_jobs = -1來並行化工作。 – dooms

您可以將一個可調用對象analyzer傳遞給CountVectorizer構造函數以提供自定義分析器。這似乎對我有用。

from sklearn.feature_extraction.text import CountVectorizer 
from nltk.stem.snowball import FrenchStemmer 

stemmer = FrenchStemmer() 
analyzer = CountVectorizer().build_analyzer() 

def stemmed_words(doc): 
    return (stemmer.stem(w) for w in analyzer(doc)) 

stem_vectorizer = CountVectorizer(analyzer=stemmed_words) 
print(stem_vectorizer.fit_transform(['Tu marches dans la rue'])) 
print(stem_vectorizer.get_feature_names())

打印出：

(0, 4) 1 
    (0, 2) 1 
    (0, 0) 1 
    (0, 1) 1 
    (0, 3) 1 
[u'dan', u'la', u'march', u'ru', u'tu']

來源

2016-03-24 00:46:47 joeln

parameters = {'vect__analyzer'：['word'，stemming]} 將此參數用作gridsearch會給出錯誤： AttributeError：'模塊'對象沒有任何屬性'stemming' – dooms

我知道我張貼在我的回答有點晚了。但在這裏，如果有人仍然需要幫助。

以下是最乾淨的方法，通過重寫build_analyser()

from sklearn.feature_extraction.text import CountVectorizer 
import nltk.stem 

french_stemmer = nltk.stem.SnowballStemmer('french') 
class StemmedCountVectorizer(CountVectorizer): 
    def build_analyzer(self): 
     analyzer = super(StemmedCountVectorizer, self).build_analyzer() 
     return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)]) 

vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french')

您可以自由調用CountVectorizer類fit和transform功能在你vectorizer_s對象添加語言詞幹計算矢量化

來源

2016-12-29 10:11:04

向CountVectorizer添加詞幹支持（sklearn）

回答

相關問題