隨機抽樣每個集羣的百分比

我正在開發一個項目，旨在利用我的數據集的集羣結構來改進用於Binray分類的受監督的主動學習分類器。我用下面的代碼中使用scikit-leanr的K均值執行羣集我的數據，X：隨機抽樣每個集羣的百分比

k = KMeans(n_clusters=(i+2), precompute_distances=True,).fit(X) 
df = pd.DataFrame({'cluster' : k.labels_, 'percentage posotive' : y}) 
a = df.groupby('cluster').apply(lambda cluster:cluster.sum()/cluster.count())

這兩個類是正（由1表示）和負（用0表示），並存儲在一個數組y。此代碼首先集羣X，然後在數據框中存儲每個簇的數量和其中的正實例的百分比數。

我現在想從每個羣集中隨機選擇點，直到我採樣了15％。我怎樣才能做到這一點？

如這裏要求是包括測試數據集的簡化腳本：

from sklearn.cluster import KMeans 
import pandas as pd 
X = [[1,2], [2,5], [1,2], [3,3], [1,2], [7,3], [1,1], [2,19], [1,11], [54,3], [78,2], [74,36]] 
y = [0,0,0,0,0,0,0,0,0,1,0,0] 
k = KMeans(n_clusters=(4), precompute_distances=True,).fit(X) 
df = pd.DataFrame({'cluster' : k.labels_, 'percentage posotive' : y}) 
a = df.groupby('cluster').apply(lambda cluster:cluster.sum()/cluster.count()) 
print(a)

注：的真實數據是由數以千計的功能和成千上萬的數據實例的大得多。

針對@SandipanDey：

我不能告訴你太多，但基本上我們正在處理一個高度不平衡的數據集（1：10,000），我們只能識別少數類實例興趣召回率> 95％，同時減少所需標籤的數量。（召回需要與醫療保健有關。）

少數示例集羣在一起，任何包含正實例的集羣通常至少包含x％，因此通過抽樣x％我們確保我們確定所有與任何正面實例聚類。因此，我們能夠快速減小數據集的大小，並有可能帶來積極的影響。這個組合數據集可以用於主動學習。我們的做法是由鬆散'Hierarchical Sampling for Active Learning'

來源

2017-03-12 scutnex

我不認爲你也代表否定與1也。無論如何，你發佈一個（小）示例數據集來做到這一點？ – Denziloe

@Denziloe好，趕緊編輯吧。將很快添加小型測試數據集。 – scutnex

@Denziloe添加了測試數據集。 – scutnex

啓發。如果我的理解是否正確，下面的代碼應該服務宗旨：

import numpy as np 

# For each cluster 
# (1) Find all the points from X that are assigned to the cluster. 
# (2) Choose x% from those points randomly. 

n_clusters = 4 
x = 0.15 # percentage 

for i in range(n_clusters): 

    # (1) indices of all the points from X that belong to cluster i 
    C_i = np.where(k.labels_ == i)[0].tolist() 
    n_i = len(C_i) # number of points in cluster i 

    # (2) indices of the points from X to be sampled from cluster i 
    sample_i = np.random.choice(C_i, int(x * n_i)) 
    print i, sample_i

只是出於好奇，你打算怎麼使用這些x%點主動學習？

來源

2017-03-12 19:30:33

謝謝！已經提供了方法的簡要說明 – scutnex

非常感謝@scutnex添加描述，非常感謝。 –

隨機抽樣每個集羣的百分比

回答

相關問題