如何防止TfidfVectorizer讓數字作爲詞彙

我用TfidfVectorizer這樣的：如何防止TfidfVectorizer讓數字作爲詞彙

from sklearn.feature_extraction.text import TfidfVectorizer 
stop_words = stopwords.words("english") 
vectorizer = TfidfVectorizer(stop_words=stop_words, min_df=200) 
xs['train'] = vectorizer.fit_transform(docs['train']) 
xs['test'] = vectorizer.transform(docs['test']).toarray()

但檢查vectorizer.vocabulary_當我發現它學純數字的特點：

[(u'00', 0), (u'000', 1), (u'0000', 2), (u'00000', 3), (u'000000', 4)

我不我不想要這個。我怎樣才能防止它？

來源

2017-08-07 Martin Thoma

您可以在初始化矢量化程序時定義token_pattern。默認值爲u'(?u)\b\w\w+\b'（(?u)部分僅打開re.UNICODE標誌）。可以搗鼓，直到你得到你需要的東西。

喜歡的東西：

vectorizer = TfidfVectorizer(stop_words=stop_words, 
          min_df=200, 
          token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b')

另一種選擇（如果數字顯示您的樣品在事情的事實）是矢量化之前掩蓋所有的數字。

re.sub('\b[0-9][0-9.,-]*\b', 'NUMBER-SPECIAL-TOKEN', sample)

這樣的數字將達到你的矢量化的詞彙相同的位置，你會不會完全要麼忽略他們。

來源

2017-08-07 13:16:56

如何防止TfidfVectorizer讓數字作爲詞彙

回答

相關問題