0

試圖字符串轉換成數字矢量單個字母,空的詞彙通過CountVectorizer

### Clean the string 
def names_to_words(names): 
    print('a') 
    words = re.sub("[^a-zA-Z]"," ",names).lower().split() 
    print('b') 

    return words 


### Vectorization 
def Vectorizer(): 
    Vectorizer= CountVectorizer(
       analyzer = "word", 
       tokenizer = None, 
       preprocessor = None, 
       stop_words = None, 
       max_features = 5000) 
    return Vectorizer 


### Test a string 
s = 'abc...' 
r = names_to_words(s) 
feature = Vectorizer().fit_transform(r).toarray() 

但是,當我encoutered:

['g', 'o', 'm', 'd'] 

有錯誤:

ValueError: empty vocabulary; perhaps the documents only contain stop words 

似乎有這種單字母字符串的問題。 我應該怎麼辦 THX

+0

所以,你想要做什麼?在你的詞彙中加入這些單個字母的單詞? –

回答

0

在CountVectorizer默認token_pattern正則表達式選擇其中至少有2個字符爲stated in documentation話:

token_pattern : string

Regular expression denoting what constitutes a 「token」, only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

source code of CountVectorizerr"(?u)\b\w\w+\b

將其更改爲r"(?u)\b\w+\b到包括1個字母的單詞。

更改您的代碼爲以下(包括token_pattern參數以上建議):

Vectorizer= CountVectorizer(
       analyzer = "word", 
       tokenizer = None, 
       preprocessor = None, 
       stop_words = None, 
       max_features = 5000, 
       token_pattern = r"(?u)\b\w+\b")