我可以在scikit-learn中使用CountVectorizer來計算未用於提取令牌的文檔的頻率嗎？

我一直在使用scikit-learn中的CountVectorizer類。我可以在scikit-learn中使用CountVectorizer來計算未用於提取令牌的文檔的頻率嗎？

我明白，如果以下面顯示的方式使用，最終輸出將包含一個包含要素或標記計數的數組。

這些令牌是由一組關鍵詞的提取，即

tags = [ 
    "python, tools", 
    "linux, tools, ubuntu", 
    "distributed systems, linux, networking, tools", 
]

下一個步驟是：

from sklearn.feature_extraction.text import CountVectorizer 
vec = CountVectorizer(tokenizer=tokenize) 
data = vec.fit_transform(tags).toarray() 
print data

我們得到

[[0 0 0 1 1 0] 
[0 1 0 0 1 1] 
[1 1 1 0 1 0]]

這是好的，但我情況只是有點不同。

我想以上述方式提取特徵，但我不希望data中的行與提取特徵的文檔相同。

換句話說，我怎麼能得到另一組文檔，比如計數，

list_of_new_documents = [ 
    ["python, chicken"], 
    ["linux, cow, ubuntu"], 
    ["machine learning, bird, fish, pig"] 
]

並獲得：

[[0 0 0 1 0 0] 
[0 1 0 0 0 1] 
[0 0 0 0 0 0]]

我已經閱讀了CountVectorizer類的文檔，來到橫跨vocabulary參數，這是一個術語到特徵索引的映射。然而，我似乎無法得到這個論據來幫助我。

任何意見表示讚賞。
PS：所有信用歸於Matthias Friedrich's Blog針對我上面使用的示例。

來源

2014-04-07 Matt O' Brien

你說得對，vocabulary是你想要的。它的工作原理是這樣的：

>>> cv = sklearn.feature_extraction.text.CountVectorizer(vocabulary=['hot', 'cold', 'old']) 
>>> cv.fit_transform(['pease porridge hot', 'pease porridge cold', 'pease porridge in the pot', 'nine days old']).toarray() 
array([[1, 0, 0], 
     [0, 1, 0], 
     [0, 0, 0], 
     [0, 0, 1]], dtype=int64)

所以你把它傳遞給你想要的特徵作爲鍵的字典。

如果您在一組文檔上使用CountVectorizer，然後您希望將這些文檔中的一組功能用於新組，請使用原始CountVectorizer的vocabulary_屬性並將其傳遞給新組。所以在你的例子中，你可以做

newVec = CountVectorizer(vocabulary=vec.vocabulary_)

使用第一個詞彙表創建一個新的tokenizer。

來源

2014-04-07 19:10:04 BrenBarn

謝謝，這看起來太棒了！對於第一個解決方案：詞彙應該總是一個字典，而不是列表？糾正我，如果我錯了，但計數（0，1，2）似乎不相關。你列出的第二種方法看起來可能更清楚一些。 –

@ MattO'Brien：你說得對，它可能是一個列表，我誤解了文檔。我編輯了我的答案。然而，在第二種方法中，這是一個字典，因爲這是一個擬合矢量化器的'vocabulary_'方法。 – BrenBarn

BrenBarn，你的回答爲我節省了很多時間。認真。感謝您在這個網站上。 –

您應該在原始詞彙來源上致電fit_transform或fit，以便向量管理器學習一個詞彙。

然後，您可以通過transform()方法在任何新數據源上使用此fit矢量化器。

您可以通過獲得的配合產生的詞彙（即單詞映射到令牌ID）通過vectorizer.vocabulary_（假設你的名字你CountVectorizer名稱vectorizer。

來源

2016-09-05 14:15:10

>>> tags = [ 
    "python, tools", 
    "linux, tools, ubuntu", 
    "distributed systems, linux, networking, tools", 
] 

>>> list_of_new_documents = [ 
    ["python, chicken"], 
    ["linux, cow, ubuntu"], 
    ["machine learning, bird, fish, pig"] 

] 

>>> from sklearn.feature_extraction.text import CountVectorizer 
>>> vect = CountVectorizer() 
>>> tags = vect.fit_transform(tags) 

# vocabulary learned by CountVectorizer (vect) 
>>> print(vect.vocabulary_) 
{'python': 3, 'tools': 5, 'linux': 1, 'ubuntu': 6, 'distributed': 0, 'systems': 4, 'networking': 2} 

# counts for tags 
>>> tags.toarray() 
array([[0, 0, 0, 1, 0, 1, 0], 
     [0, 1, 0, 0, 0, 1, 1], 
     [1, 1, 1, 0, 1, 1, 0]], dtype=int64) 

# to use `transform`, `list_of_new_documents` should be a list of strings 
# `itertools.chain` flattens shallow lists more efficiently than list comprehensions 

>>> from itertools import chain 
>>> new_docs = list(chain.from_iterable(list_of_new_documents) 
>>> new_docs = vect.transform(new_docs) 

# finally, counts for new_docs! 
>>> new_docs.toarray() 
array([[0, 0, 0, 1, 0, 0, 0], 
     [0, 1, 0, 0, 0, 0, 1], 
     [0, 0, 0, 0, 0, 0, 0]])

要驗證CountVectorizer使用的詞彙從tags上new_docs獲悉：重新打印vect.vocabulary_或new_docs.toarray()輸出比較到的tags.toarray()

來源

2017-10-24 15:34:06 user2476665

我可以在scikit-learn中使用CountVectorizer來計算未用於提取令牌的文檔的頻率嗎？

回答

相關問題