R的removeSparseTerms在Python中的等效

我們正在研究一個數據挖掘項目，並在R中的tm包中使用removeSparseTerms函數來減少文檔項矩陣的特徵。R的removeSparseTerms在Python中的等效

但是，我們正在尋找將代碼移植到python。在sklearn，nltk或其他一些可以提供相同功能的軟件包中是否有函數？

謝謝！

2015-06-29 AnirudhJ

如果您的數據是純文本，您可以使用CountVectorizer爲了完成這項工作。

例如：

from sklearn.feature_extraction.text import CountVectorizer 
vectorizer = CountVectorizer(min_df=2) 
corpus = [ 
    'This is the first document.', 
    'This is the second second document.', 
    'And the third one.', 
    'Is this the first document?', 
] 
vectorizer = vectorizer.fit(corpus) 
print vectorizer.vocabulary_ 
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1} 
X = vectorizer.transform(corpus)

現在X是文檔長期矩陣。（如果你到信息檢索您還想Tf–idf term weighting考慮

它可以幫助你用幾行輕鬆搞定文檔長期矩陣

關於稀疏性 - 你可以控制這些參數：。

min_df - 允許在文檔長期矩陣的項的最小文檔頻率
max_features - m個在文檔長期矩陣允許

或者，如果你已經有了文檔長期矩陣或TF-IDF矩陣，和你有什麼是稀疏的概念，定義MIN_VAL_ALLOWED，然後做功能aximum號以下內容：

import numpy as np 
from scipy.sparse import csr_matrix 
MIN_VAL_ALLOWED = 2 

X = csr_matrix([[7,8,0], 
       [2,1,1], 
       [5,5,0]]) 

z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms 

print X[:,z].toarray() 
#prints X without the third term (as it is sparse) 
[[7 8] 
[2 1] 
[5 5]]

（使用X = X[:,z]所以X仍然csr_matrix。）

如果是最小文檔頻率你想設置一個門檻上，binarize矩陣第一，而且比用同樣的方式：

import numpy as np 
from scipy.sparse import csr_matrix 

MIN_DF_ALLOWED = 2 

X = csr_matrix([[7, 1.3, 0.9, 0], 
       [2, 1.2, 0.8 , 1], 
       [5, 1.5, 0 , 0]]) 

#Creating a copy of the data 
B = csr_matrix(X, copy=True) 
B[B>0] = 1 
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED)) 
print X[:,z].toarray() 
#prints 
[[ 7. 1.3] 
[ 2. 1.2] 
[ 5. 1.5]]

在這個例子中，第三和第四學期（或列）都走了，因爲它們只出現在兩個文件（行）。使用MIN_DF_ALLOWED來設置閾值。

來源

2015-06-29 07:34:48 omerbp

R的removeSparseTerms在Python中的等效

回答

相關問題