2017-06-29 42 views
2

我有一個實現k-均值算法的函數,我想將它與DataFrames一起使用以考慮索引。目前我使用DataFrame.values,它的工作原理。但我沒有得到輸出的索引。在numpy數組函數之後獲取數據幀的索引

def cluster_points(X, mu): 
    clusters = {} 
    for x in X: 
     bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \ 
        for i in enumerate(mu)], key=lambda t:t[1])[0] 
     try: 
      clusters[bestmukey].append(x) 
     except KeyError: 
      clusters[bestmukey] = [x] 
    return clusters 

def reevaluate_centers(mu, clusters): 
    newmu = [] 
    keys = sorted(clusters.keys()) 
    for k in keys: 
     newmu.append(np.mean(clusters[k], axis = 0)) 
    return newmu 

def has_converged(mu, oldmu): 
    return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu])) 


def find_centers(X, K): 
    # Initialize to K random centers 
    oldmu = random.sample(X, K) 
    mu = random.sample(X, K) 
    while not has_converged(mu, oldmu): 
     oldmu = mu 
     # Assign all points in X to clusters 
     clusters = cluster_points(X, mu) 
     # Reevaluate centers 
     mu = reevaluate_centers(oldmu, clusters) 
    return(mu, clusters) 

例如與如此例如最小的和足夠的:

import itertools 

df = pd.DataFrame(np.random.randint(0,10,size=(10, 5)), index = list(range(10)), columns=list(range(5))) 
df.index.name = 'subscriber_id' 
df.columns.name = 'ad_id' 

我得到:

find_centers(df.values, 2) 
([array([ 3.8, 3. , 3.6, 2. , 3.6]), 
    array([ 6.8, 3.6, 5.6, 6.8, 6.8])], 
{0: [array([2, 0, 5, 6, 4]), 
    array([1, 1, 2, 3, 3]), 
    array([6, 0, 4, 0, 3]), 
    array([7, 9, 4, 1, 7]), 
    array([3, 5, 3, 0, 1])], 
    1: [array([6, 2, 5, 9, 6]), 
    array([8, 9, 7, 2, 8]), 
    array([7, 5, 3, 7, 8]), 
    array([7, 1, 5, 7, 6]), 
    array([6, 1, 8, 9, 6])]}) 

我有值,但沒有指標。

回答

1

如果你想獲取值的陣列包括索引,你可以簡單地將索引添加到列與reset_index()

values_with_index = df.reset_index().values 

更新

如果你想要的是有輸出上的索引,但在實際集羣期間不使用它,可以執行以下操作。首先,通過實際的數據幀對象find_centers

find_centers(df, 2) 

然後改變cluster_points如下:

def cluster_points(X, mu): 
    clusters = {} 
    for _, x in X.iterrows(): 
     bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) 
         for i in enumerate(mu)], key=lambda t:t[1])[0] 
     # You can replace this try/except block with 
     # clusters.setdefault(bestmukey, []).append(x) 
     try: 
      clusters[bestmukey].append(x) 
     except KeyError: 
      clusters[bestmukey] = [x] 
    return clusters 

在輸出仍會陣列的中心,但集羣將包含與每個系列的對象行。這些系列中的每一個的name屬性都是數據框中的索引值。

+0

OP可能意味着應用他的函數'find_centers' – Marine1

+0

@ Marine1你可能是對的,我被「爲了考慮索引」部分困惑,但這更有意義......我已經更新了答案。 – jdehesa