2017-02-10 107 views
2

我正在處理一個問題,我必須將相關的項目分組併爲其分配唯一的標識。我用python編寫了代碼,但它沒有返回預期的輸出。我需要幫助來完善我的邏輯。代碼如下:爲組創建唯一標識

data = {} 
child_list = [] 


for index, row in df.iterrows(): 
    parent = row['source'] 
    child = row['target'] 
    #print 'Parent: ', parent 
    #print 'Child:', child 
    child_list.append(child) 
    #print child_list 
    if parent not in data.keys(): 
     data[parent] = [] 
    if parent != child: 
     data[parent].append(child) 
    #print data 

op = {} 
gid = 0 


def recursive(op,x,gid): 
    if x in data.keys() and data[x] != []: 
     for x_child in data[x]: 
      if x_child in data.keys(): 
       op[x_child] = gid 
       recursive(op,x_child,gid) 
      else: 
       op[x] = gid 
    else: 
     op[x] = gid 


for key in data.keys(): 
    #print "Key: ", key 
    if key not in child_list: 
     gid = gid + 1 
     op[key] = gid 
     for x in data[key]: 
      op[x] = gid 
      recursive(op,x,gid) 

related = pd.DataFrame({'items':op.keys(), 
        'uniq_group_id': op.values()}) 
mapped.sort_values('items') 

實例下

Input: 
source target 
a  b 
b  c 
c  c 
c  d 
d  d 
e  f 
a  d 
h  a 
i  f 

Desired Output: 
item  uniq_group_id 
a   1 
b   1 
c   1 
d   1 
h   1 
e   2 
f   2 
i   2 

我的代碼給我下面這是錯誤的輸出。

item uniq_group_id 
a  3 
b  3 
c  3 
d  3 
e  1 
f  2 
h  3 
i  2 

另一個實施例

Input: 
df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'], 
       'target':['b','c','c','d','d','f','d','a','f','a']}) 
Desired Output: 
item uniq_group_id 
a  1 
b  1 
c  1 
d  1 
e  2 
f  2 

My code Output: 
item uniq_group_id 
e 1 
f 1 

行或組ID無關緊要的順序。這裏重要的是分配相關項目相同的唯一標識符。整個問題是找到相關的項目組併爲其分配唯一的組ID。

回答

1

我沒有仔細分析你的代碼,但它看起來像錯誤是因爲你填充data字典的方式。它將子節點存儲爲其父節點的鄰居,但也需要將父節點存儲爲子節點的鄰居。

與其嘗試修復您的代碼,我決定修改Aseem Goyal編寫的this pseudocode。下面的代碼從簡單的Python列表中獲取其輸入數據,但應該很容易使其適用於Pandas數據框。

''' Find all the connected components of an undirected graph ''' 

from collections import defaultdict 

src = ['a', 'b', 'c', 'c', 'd', 'e', 'a', 'h', 'i', 'a'] 
tgt = ['b', 'c', 'c', 'd', 'd', 'f', 'd', 'a', 'f', 'a'] 

nodes = sorted(set(src + tgt)) 
print('Nodes', nodes) 

neighbors = defaultdict(set) 
for u, v in zip(src, tgt): 
    neighbors[u].add(v) 
    neighbors[v].add(u) 

print('Neighbors') 
for n in nodes: 
    print(n, neighbors[n]) 

visited = {} 
def depth_first_traverse(node, group_id): 
    for n in neighbors[node]: 
     if n not in visited: 
      visited[n] = group_id 
      depth_first_traverse(n, group_id) 

print('Groups') 
group_id = 1 
for n in nodes: 
    if n not in visited: 
     visited[n] = group_id 
     depth_first_traverse(n, group_id) 
     group_id += 1 
    print(n, visited[n]) 

輸出

Nodes ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'i'] 
Neighbors 
a {'a', 'd', 'b', 'h'} 
b {'a', 'c'} 
c {'d', 'b', 'c'} 
d {'d', 'a', 'c'} 
e {'f'} 
f {'i', 'e'} 
h {'a'} 
i {'f'} 
Groups 
a 1 
b 1 
c 1 
d 1 
e 2 
f 2 
h 1 
i 2 

此代碼爲Python 3是書面的,也將在Python的運行2.如果你在Python 2中運行它,你應該在頂部添加from __future__ import print_function您進口報表;這不是絕對必要的,但它會使輸出看起來更好。

+0

謝謝。這個邏輯對我的用例來說工作正常。 – Sam

1

您可以使用此Union-Find, or Disjoint-Sets algorithm。有關更完整的說明,請參閱this related answer。基本上,你需要的leaders兩個功能,unionfind,創建一個樹(即嵌套的字典)或前輩:

leaders = collections.defaultdict(lambda: None) 

def find(x): 
    l = leaders[x] 
    if l is not None: 
     l = find(l) 
     leaders[x] = l 
     return l 
    return x 

def union(x, y): 
    lx, ly = find(x), find(y) 
    if lx != ly: 
     leaders[lx] = ly 

您可以將此您的問題如下:

df = pd.DataFrame({'source': ['a','b','c','c','d','e','a','h','i','a'], 
        'target': ['b','c','c','d','d','f','d','a','f','a']}) 

# build the tree 
for _, row in df.iterrows(): 
    union(row["source"], row["target"]) 

# build groups based on leaders 
groups = collections.defaultdict(set) 
for x in leaders: 
    groups[find(x)].add(x) 
for num, group in enumerate(groups.values(), start=1): 
    print(num, group) 

結果:

1 {'e', 'f', 'i'} 
2 {'h', 'a', 'c', 'd', 'b'} 
+0

謝謝你的解決方案。這也適用。 – Sam