2013-12-18 41 views
1

我再次遇到使用scikit-learn側身係數的麻煩。 (第一個問題在這裏:silhouette coefficient in python with sklearn)。 我做了一個非常不平衡但具有很多個體的聚類,所以我想使用輪廓係數的採樣參數。我想知道子採樣是否是分層的,這意味着採樣與集羣有關。我以虹膜數據集爲例,但我的數據集要大得多(這就是爲什麼我需要採樣)。 我的代碼是:sklearn中的分層係數子採樣是否分層?

from sklearn import datasets 
from sklearn.metrics import * 
iris = datasets.load_iris() 
col = iris.feature_names 
name = iris.target_names 
X = pd.DataFrame(iris.data, columns = col) 
y = iris.target 
s = silhouette_score(X.values, y, metric='euclidean',sample_size=50) 

哪些工作。但現在如果我偏向與:

y[0:148] =0 
y[148] = 1 
y[149] = 2 
print y 
s = silhouette_score(X.values, y, metric='euclidean',sample_size=50) 

我得到:

ValueError        Traceback (most recent call last) 
<ipython-input-12-68a7fba49c54> in <module>() 
     4 y[149] =2 
     5 print y 
----> 6 s = silhouette_score(X.values, y, metric='euclidean',sample_size=50) 

/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds) 
    82   else: 
    83    X, labels = X[indices], labels[indices] 
---> 84  return np.mean(silhouette_samples(X, labels, metric=metric, **kwds)) 
    85 
    86 

/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_samples(X, labels, metric, **kwds) 
    146     for i in range(n)]) 
    147  B = np.array([_nearest_cluster_distance(distances[i], labels, i) 
--> 148     for i in range(n)]) 
    149  sil_samples = (B - A)/np.maximum(A, B) 
    150  # nan values are for clusters of size 1, and should be 0 

/usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in _nearest_cluster_distance(distances_row, labels, i) 
    200  label = labels[i] 
    201  b = np.min([np.mean(distances_row[labels == cur_label]) 
--> 202    for cur_label in set(labels) if not cur_label == label]) 
    203  return b 

/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.pyc in amin(a, axis, out, keepdims) 
    1980   except AttributeError: 
    1981    return _methods._amin(a, axis=axis, 
-> 1982         out=out, keepdims=keepdims) 
    1983   # NOTE: Dropping the keepdims parameter 
    1984   return amin(axis=axis, out=out) 

/usr/lib/python2.7/dist-packages/numpy/core/_methods.pyc in _amin(a, axis, out, keepdims) 
    12 def _amin(a, axis=None, out=None, keepdims=False): 
    13  return um.minimum.reduce(a, axis=axis, 
---> 14        out=out, keepdims=keepdims) 
    15 
    16 def _sum(a, axis=None, dtype=None, out=None, keepdims=False): 

ValueError: zero-size array to reduction operation minimum which has no identity 

的錯誤,是因爲我覺得這樣的事實,抽樣是隨機的不分層,因此它沒有考慮到這兩個小集羣。

我正確嗎?

回答

1

我認爲你是對的,目前的實現不支持平衡重採樣。

2

是的你是對的。採樣不分層,因爲採樣時不考慮標籤。

這是將樣品如何採取(0.14.1版)

indices = random_state.permutation(X.shape[0])[:sample_size] 

當X是大小的輸入陣列[n_samples_a,n_samples_a]或[n_samples_a,n_features]。