獲取CountVectorizer以包含「1：1」

如果我有一些包含短語「1：1」的文本。我如何獲得CountVectorizer將其識別爲令牌？獲取CountVectorizer以包含「1：1」

text = ["first ques # 1:1 on stackoverflow", "please help"] 
vec = CountVectorizer() 
vec.fit_transform(text) 

vec.get_feature_names()

2017-03-18 Huey

您可以使用自定義標記器。對於簡單的情況下，通過

vec = CountVectorizer(tokenizer=lambda s: s.split())

更換

vec = CountVectorizer()

會做。有了這個修改您的代碼返回：

[u'#', u'1:1', u'first', u'help', u'on', u'please', u'ques', u'stackoverflow']

希望這個建議將會把你在正確的軌道上，但是請注意，這種解決辦法將無法正常在更復雜的情況下（如果您的文本有標點符號例如）工作。

爲了應對標點符號標記，你可以通過CountVectorizer一個標記模式是這樣的：

text = [u"first ques... # 1:1, on stackoverflow", u"please, help!"] 
vec = CountVectorizer(token_pattern=u'\w:?\w+')

輸出：

[u'1:1', u'first', u'help', u'on', u'please', u'ques', u'stackoverflow']

2017-03-19 10:29:53 Tonechas

呀，什麼的ΔRemÿ選擇，如果我的文字有標點符號？ – Huey

回答