根據其他2列確定列值

有2列，Label1和Label2。它們都是使用不同方法的羣集標籤。根據其他2列確定列值

Label1 Label2 
0 0 1024 
1 1 1024 
2 2 1025 
3 3 1026 
4 3 1027 
5 4 1028

我想根據這兩列獲得最終的集羣標籤。比較每一行，只要這兩個標籤中的一個相同，它們就在同一個集羣中。

例如：行0和行1共有標籤2，行3和行4共有標籤1，因此同一組中的行0和行1以及同一組中的行3和行4。所以，有我想要的結果：

Label1 Label2 Cluster ID 
0 0 1024 0 
1 1 1024 0 
2 2 1025 1 
3 3 1026 2 
4 3 1027 2 
5 4 1028 3

什麼是做到這一點的最好辦法任何幫助，將不勝感激。

編輯：我想我沒有舉一個好例子。 NOTE：事實上，標籤不一定按任何順序：

Label1 Label2 
0 0 1024 
1 1 1023 
2 2 1025 
3 3 1024 
4 3 1027 
5 4 1022

來源

2016-08-26 aidsj

請問您可以發佈最佳嘗試代碼嗎？謝謝 – lrnzcig

請檢查此鏈接以獲得更多幫助 - http://stackoverflow.com/help/how-to-ask –

試試這個：使用np。在哪裏和pandas.duplicated

df    = df.sort_values(['Label1', 'Label2']) 
df['Cluster'] = np.where((df.Label1.duplicated()) | (df.Label2.duplicated()),0,1).cumsum() 
print df 

     Label1 Label2 Cluster 
0  0 1024  1 
1  1 1024  1 
2  2 1025  2 
3  3 1026  3 
4  3 1027  3 
5  4 1028  4

來源

2016-08-26 15:33:11 Merlin

謝謝。我已經更新了這個問題，我的問題是標籤欄不是單調的。相同的標籤可以出現在第1行，然後是第100行，並且它們應該作爲同一組進行聚類。有什麼建議麼。 – aidsj

我測試過它，它確實依賴於訂單。給定df = pd.DataFrame（ {'Label1'：[0,1,2,2,1,3]，'Label2'：[1023,1024,1025,1026,1027,1028]}）。結果：1,2,3,3,3,4。但是，第1行和第4行應該在同一組中。無論如何，謝謝你的幫助。 – aidsj

是的，你是righ，固定它，cumsum使用是基於秩序，祝你好運 – Merlin

不知道我理解正確的問題，但這裏有一個可能的方法來識別集羣：

import pandas as pd 
import collections 

df = pd.DataFrame(
    {'Label1': [0, 1, 2, 3, 3, 4], 'Label2': [1024, 1024, 1025, 1026, 1027, 1028]}) 
df['Cluster ID'] = [0] * 6 

counter1 = {k: v for k, v in collections.Counter(
    df['Label1']).iteritems() if v > 1} 
counter1 = counter1.keys() 
counter2 = {k: v for k, v in collections.Counter(
    df['Label2']).iteritems() if v > 1} 
counter2 = counter2.keys() 

len1 = len(counter1) 
len2 = len(counter2) 
index_cluster = len1 + len2 

for index, row in df.iterrows(): 
    if row['Label2'] in counter2: 
     df.loc[index, 'Cluster ID'] = counter2.index(row['Label2']) 
    elif row['Label1'] in counter1: 
     df.loc[index, 'Cluster ID'] = counter1.index(row['Label1']) + len2 
    else: 
     df.loc[index, 'Cluster ID'] = index_cluster 
     index_cluster += 1 

print df

來源

2016-08-26 11:10:41 BPL

謝謝。請你詳細說明一下嗎？ – aidsj

這裏是你如何實現這一點：

檢查上一行的相同值的兩列
如果任一值的是一樣的，不會增加簇號，並添加到羣集列表
如果沒有值是一樣的，增量的簇號，並添加到羣集列表
添加集羣列表，列到數據幀。

代碼：

import pandas as pd 

df=pd.DataFrame([[0,1,2,3,4,5],[0,1,2,3,3,4],[1024,1024,1025,1026,1027,1028]]).T 
cluster_num = 0 
cluster_list = [] 
for i,row in df.iterrows(): 
    if i!=0: 
     # check previous row 
     if df.loc[i-1][1]==row[1] or df.loc[i-1][2]==row[2]: 
      # add to previous cluster 
      cluster_list.append(cluster_num) 
     else: 
      # create new cluster 
      cluster_num+=1 
      cluster_list.append(cluster_num) 
    else: 
     cluster_list.append(cluster_num) 

#Add the list as column 
df.insert(3,3,cluster_list)

來源

2016-08-26 11:30:25

IIUC，你可以組羣如下：

以行之間的差異，它的下一行，以0填充最上面一行，並發現它的標籤[1和2]的累計總和。

In [2]: label1_ = df['Label1'].diff().fillna(0).cumsum() 

In [3]: label2_ = df['Label2'].diff().fillna(0).cumsum()

將這些連接到一個新的數據框，併爲兩個標籤[1,2]分別刪除重複的值。依次爲reset_index以取回默認整數索引。

In [4]: df_ = pd.concat([label1_, label2_], axis=1).drop_duplicates(['Label1']) \ 
                .drop_duplicates(['Label2'])  \ 
                .reset_index()

將索引值分配給一個新列，即集羣ID。

In [5]: df_['Cluster_ID'] = df_.index 

In [6]: df_.set_index('index', inplace=True) 

In [7]: df['Cluster_ID'] = df_['Cluster_ID']

更換Nan值與它以前的有限值，並鑄造了最後的答案爲整數。

In [8]: df.fillna(method='ffill').astype(int) 
Out[8]: 
    Label1 Label2 Cluster_ID 
0  0 1024   0 
1  1 1024   0 
2  2 1025   1 
3  3 1026   2 
4  3 1027   2 
5  4 1028   3

來源

2016-08-26 12:17:45

根據其他2列確定列值

回答

相關問題