2
我用TfidfVectorizer這樣的:如何防止TfidfVectorizer讓數字作爲詞彙
from sklearn.feature_extraction.text import TfidfVectorizer
stop_words = stopwords.words("english")
vectorizer = TfidfVectorizer(stop_words=stop_words, min_df=200)
xs['train'] = vectorizer.fit_transform(docs['train'])
xs['test'] = vectorizer.transform(docs['test']).toarray()
但檢查vectorizer.vocabulary_
當我發現它學純數字的特點:
[(u'00', 0), (u'000', 1), (u'0000', 2), (u'00000', 3), (u'000000', 4)
我不我不想要這個。我怎樣才能防止它?