空的詞彙通過CountVectorizer

試圖字符串轉換成數字矢量單個字母，空的詞彙通過CountVectorizer

### Clean the string 
def names_to_words(names): 
    print('a') 
    words = re.sub("[^a-zA-Z]"," ",names).lower().split() 
    print('b') 

    return words 


### Vectorization 
def Vectorizer(): 
    Vectorizer= CountVectorizer(
       analyzer = "word", 
       tokenizer = None, 
       preprocessor = None, 
       stop_words = None, 
       max_features = 5000) 
    return Vectorizer 


### Test a string 
s = 'abc...' 
r = names_to_words(s) 
feature = Vectorizer().fit_transform(r).toarray()

但是，當我encoutered：

['g', 'o', 'm', 'd']

有錯誤：

ValueError: empty vocabulary; perhaps the documents only contain stop words

似乎有這種單字母字符串的問題。我應該怎麼辦 THX

來源

2017-04-25 user815408

所以，你想要做什麼？在你的詞彙中加入這些單個字母的單詞？ –

在CountVectorizer默認token_pattern正則表達式選擇其中至少有2個字符爲stated in documentation話：

token_pattern : string

Regular expression denoting what constitutes a 「token」, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

從source code of CountVectorizer是r"(?u)\b\w\w+\b

將其更改爲r"(?u)\b\w+\b到包括1個字母的單詞。

更改您的代碼爲以下（包括token_pattern參數以上建議）：

Vectorizer= CountVectorizer(
       analyzer = "word", 
       tokenizer = None, 
       preprocessor = None, 
       stop_words = None, 
       max_features = 5000, 
       token_pattern = r"(?u)\b\w+\b")

來源

2017-04-25 08:23:34

空的詞彙通過CountVectorizer

回答

相關問題