2017-06-06 48 views
1

我正在嘗試使用我的自定義分析器來創建文檔矩陣以從文檔中提取特徵。下面是相同的代碼:vectorizer.fit_transform給出NotImplementedError:不支持將非零標量添加到稀疏矩陣

vectorizer = CountVectorizer( \ 
          ngram_range=(1,2), 
         ) 
analyzer=vectorizer.build_analyzer() 
def customAnalyzer(text): 
    grams = analyzer(text) 
    tgrams = [gram for gram in grams if not re.match("^[0-9\s]+$",gram)] 
    return tgrams 

調用此函數來創建自定義分析儀,用於由countVectorizer提取功能。

for i in xrange(0, num_rows): 
    clean_query.append(review_to_words(inp["keyword"][i] , units)) 
vectorizer = CountVectorizer(analyzer = customAnalyzer, \ 
          tokenizer = None, \ 
          ngram_range=(1,2), \ 
          preprocessor = None, \ 
          stop_words = None, \ 
          max_features = n, 
          )  
features = vectorizer.fit_transform(clean_query) 
z = vectorizer.get_feature_names() 

此調用引發以下錯誤:

(<type 'exceptions.NotImplementedError'>, 'python.py', 128,NotImplementedError('adding a nonzero scalar to a sparse matrix is not supported',)) 

當我們調用矢量化,以適應和改變這個錯誤出現。 但變量clean_query的值不是標量。我正在使用sklearn-0.17.1

np.isscalar(clean_query) 
False 
+0

發佈數據,以便我們可以複製錯誤。 –

回答

0

這是一個小測試,我重做了錯誤,但它並沒有給我帶來同樣的錯誤。 (這個例子已經摘自:scikit-learn Feature extraction

scikit-learn version : 0.19.dev0

In [1]: corpus = [ 
    ...: ...  'This is the first document.', 
    ...: ...  'This is the second second document.', 
    ...: ...  'And the third one.', 
    ...: ...  'Is this the first document?', 
    ...: ... ] 

In [2]: from sklearn.feature_extraction.text import TfidfVectorizer 

In [3]: vectorizer = TfidfVectorizer(min_df=1) 

In [4]: vectorizer.fit_transform(corpus) 
Out[4]: 
<4x9 sparse matrix of type '<type 'numpy.float64'>' 
    with 19 stored elements in Compressed Sparse Row format> 

In [5]: import numpy as np 

In [6]: np.isscalar(corpus) 
Out[6]: False 

In [7]: type(corpus) 
Out[7]: list 

從上面的代碼中可以看到,語料庫是不是標並具有類型列表。

我認爲你的解決方案在於創建clean_query變量,正如vectorizer.fit_transform函數預期的那樣。