2017-03-01 60 views
6

我正在嘗試使用scikit-learn的DBSCAN實現將一堆文檔羣集化。首先,我使用scikit-learn的TfidfVectorizer(它是一個類型爲numpy.float64的163405x13029稀疏矩陣)創建TF-IDF矩陣。然後我嘗試將這個矩陣的特定子集聚類。當子集很小時(比如說,幾千行),事情就可以正常工作。但是對於大型子集(成千上萬行),我得到ValueError: could not convert integer scalar使用DBSCAN時出現「無法轉換整數標量」錯誤

以下是完整回溯(idxs是指數列表):


ValueError      Traceback (most recent call last) 
<ipython-input-1-73ee366d8de5> in <module>() 
    193  # use descriptions to clusterize items 
    194  ncm_clusterizer = DBSCAN() 
--> 195  ncm_clusterizer.fit_predict(tfidf[idxs]) 
    196  idxs_clusters = list(zip(idxs, ncm_clusterizer.labels_)) 
    197  for e in idxs_clusters: 

/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit_predict(self, X, y, sample_weight) 
    294    cluster labels 
    295   """ 
--> 296   self.fit(X, sample_weight=sample_weight) 
    297   return self.labels_ 

/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit(self, X, y, sample_weight) 
    264   X = check_array(X, accept_sparse='csr') 
    265   clust = dbscan(X, sample_weight=sample_weight, 
--> 266      **self.get_params()) 
    267   self.core_sample_indices_, self.labels_ = clust 
    268   if len(self.core_sample_indices_): 

/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, sample_weight, n_jobs) 
    136   # This has worst case O(n^2) memory complexity 
    137   neighborhoods = neighbors_model.radius_neighbors(X, eps, 
--> 138               return_distance=False) 
    139 
    140  if sample_weight is None: 

/usr/local/lib/python3.4/site-packages/sklearn/neighbors/base.py in radius_neighbors(self, X, radius, return_distance) 
    584    if self.effective_metric_ == 'euclidean': 
    585     dist = pairwise_distances(X, self._fit_X, 'euclidean', 
--> 586           n_jobs=self.n_jobs, squared=True) 
    587     radius *= radius 
    588    else: 

/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds) 
    1238   func = partial(distance.cdist, metric=metric, **kwds) 
    1239 
-> 1240  return _parallel_pairwise(X, Y, func, n_jobs, **kwds) 
    1241 
    1242 

/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds) 
    1081  if n_jobs == 1: 
    1082   # Special case to avoid picklability checks in delayed 
-> 1083   return func(X, Y, **kwds) 
    1084 
    1085  # TODO: in some cases, backend='threading' may be appropriate 

/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared) 
    243   YY = row_norms(Y, squared=True)[np.newaxis, :] 
    244 
--> 245  distances = safe_sparse_dot(X, Y.T, dense_output=True) 
    246  distances *= -2 
    247  distances += XX 

/usr/local/lib/python3.4/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output) 
    184   ret = a * b 
    185   if dense_output and hasattr(ret, "toarray"): 
--> 186    ret = ret.toarray() 
    187   return ret 
    188  else: 

/usr/local/lib/python3.4/site-packages/scipy/sparse/compressed.py in toarray(self, order, out) 
    918  def toarray(self, order=None, out=None): 
    919   """See the docstring for `spmatrix.toarray`.""" 
--> 920   return self.tocoo(copy=False).toarray(order=order, out=out) 
    921 
    922  ############################################################## 

/usr/local/lib/python3.4/site-packages/scipy/sparse/coo.py in toarray(self, order, out) 
    256   M,N = self.shape 
    257   coo_todense(M, N, self.nnz, self.row, self.col, self.data, 
--> 258      B.ravel('A'), fortran) 
    259   return B 
    260 

ValueError: could not convert integer scalar 

我使用Python 3.4.3(紅帽),SciPy的0.18.1和scikit學習0.18.1。

我試過猴子補丁建議here但沒有奏效。

谷歌搜索我發現一個bugfix顯然解決了其他類型的稀疏矩陣(如csr)相同的問題,但不是爲coo。

我試過喂DBSCAN稀疏半徑鄰域圖(而不是一個特徵矩陣),建議here,但同樣的錯誤發生。

我試過HDBSCAN,但同樣的錯誤發生。

我該如何解決這個問題或繞過這個?

+0

什麼是'fit_predict(tfidf [idxs])'中的'idxs''。你只使用tfidf的一些值嗎? –

+0

'idxs'索引列表(是的,我只使用了tfidf的一些值 - 它總共有〜163k文件,但我只使用~107k) – Parzival

+0

您是否嘗試過更新scipy和scikit版本? –

回答

3

即使實現允許它,DBSCAN可能會在這種非常高維的數據(從統計的角度來看,由於維度的詛咒)產生不好的結果。

相反,我建議您使用TruncatedSVD類將TF-IDF特徵向量的維數降至50或100個分量,然後在結果上應用DBSCAN

相關問題