2014-04-07 198 views
0

我已經根據ID列表創建了以下大熊貓DataFrame切片基於另一個大熊貓數據框

In [8]: df = pd.DataFrame({'groups' : [1,2,3,4], 
       'id' : ["[1,3]","[2]","[5]","[4,6,7]"]}) 
Out[9]: 
    groups  id 
0  1 [1,3] 
1  2  [2] 
2  3  [5] 
3  4 [4,6,7] 

還有另一個DataFrame如下。

In [12]: df2 = pd.DataFrame({'id' : [1,2,3,4,5,6,7], 
       'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3","p1,p2","p1","p2,p3,p4"]}) 

我需要獲取每個組的路徑值。 E.g

groups path 
1  p1,p2,p3,p4 
     p1,p5,p5,p7 
2  p1,p2,p1 
3  p1,p2 
4  p1,p2,p3,p3 
     p1 
     p2,p3,p4 

回答

0

我不知道這是一個相當做到這一點的最好辦法,但它爲我工作。順便說一句,如果你創建DF 1 id變量沒有「」標記,即作爲列表,而不是字符串...

import itertools 

df = pd.DataFrame({'groups' : [1,2,3,4], 
        'id' : [[1,3],[2],[5],[4,6,7]]}) 
df2 = pd.DataFrame({'id' : [1,2,3,4,5,6,7], 
        'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3","p1,p2","p1","p2,p3,p4"]}) 

paths = [[] for group in df.groups.unique()] 
for x in df.index: 
    paths[x].extend(itertools.chain(*[list(df2[df2.id == int(y)]['path']) for y in df.id[x]]))          
df['paths'] = pd.Series(paths) 
df 

有可能是這樣做的更加簡潔的方式,但它的一個奇怪的這隻作品數據結構的方式。給出了下面的輸出

groups id   paths 
0 1  [1, 3]  [p1,p2,p3,p4, p1,p5,p5,p7] 
1 2  [2]   [p1,p2,p1] 
2 3  [5]   [p1,p2] 
3 4  [4, 6, 7]  [p1,p2,p3,p3, p1, p2,p3,p4] 
0

你不應該構建您的DataFrame具有嵌入式list對象。相反,根據ID的長度重複組,然後使用pandas.merge,如下所示:

In [143]: groups = list(range(1, 5)) 

In [144]: ids = [[1, 3], [2], [5], [4, 6, 7]] 

In [145]: df = DataFrame({'groups': np.repeat(groups, list(map(len, ids))), 'id': reduce(lambda 
x, y: x + y, ids)}) 

In [146]: df2 = pd.DataFrame({'id' : [1,2,3,4,5,6,7], 
       'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3","p1,p2","p1","p 
2,p3,p4"]}) 

In [147]: df 
Out[147]: 
    groups id 
0  1 1 
1  1 3 
2  2 2 
3  3 5 
4  4 4 
5  4 6 
6  4 7 

[7 rows x 2 columns] 

In [148]: df2 
Out[148]: 
    id   path 
0 1 p1,p2,p3,p4 
1 2  p1,p2,p1 
2 3 p1,p5,p5,p7 
3 4 p1,p2,p3,p3 
4 5  p1,p2 
5 6   p1 
6 7  p2,p3,p4 

[7 rows x 2 columns] 

In [149]: pd.merge(df, df2, on='id', how='outer') 
Out[149]: 
    groups id   path 
0  1 1 p1,p2,p3,p4 
1  1 3 p1,p5,p5,p7 
2  2 2  p1,p2,p1 
3  3 5  p1,p2 
4  4 4 p1,p2,p3,p3 
5  4 6   p1 
6  4 7  p2,p3,p4 

[7 rows x 3 columns]