2016-02-25 97 views
3

數據幀熊貓:如何比較數據框中的列與行列的列與熊貓(不適用於循環)?

df = pd.DataFrame({'A': [['gener'], ['gener'], ['system'], ['system'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['gutter'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum'], ['aluminum', 'toledo']], 'B': [['gutter'], ['gutter'], ['gutter', 'system'], ['gutter', 'guard', 'system'], ['ohio', 'gutter'], ['gutter', 'toledo'], ['toledo', 'gutter'], ['gutter'], ['gutter'], ['gutter'], ['how', 'to', 'instal', 'aluminum', 'gutter'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'color'], ['aluminum', 'gutter'], ['aluminum', 'gutter', 'adrian', 'ohio'], ['aluminum', 'gutter', 'bowl', 'green', 'ohio'], ['aluminum', 'gutter', 'maume', 'ohio'], ['aluminum', 'gutter', 'perrysburg', 'ohio'], ['aluminum', 'gutter', 'tecumseh', 'ohio'], ['aluminum', 'gutter', 'toledo', 'ohio']]}, columns=['A', 'B']) 

是什麼樣子

我有名單的兩列的數據幀。

     A          B 
0    [gener]        [gutter] 
1    [gener]        [gutter] 
2    [system]      [gutter, system] 
3    [system]    [gutter, guard, system] 
4    [gutter]       [ohio, gutter] 
5    [gutter]      [gutter, toledo] 
6    [gutter]      [toledo, gutter] 
7    [gutter]        [gutter] 
8    [gutter]        [gutter] 
9    [gutter]        [gutter] 
10   [aluminum] [how, to, instal, aluminum, gutter] 
11   [aluminum]      [aluminum, gutter] 
12   [aluminum]    [aluminum, gutter, color] 
13   [aluminum]      [aluminum, gutter] 
14   [aluminum]  [aluminum, gutter, adrian, ohio] 
15   [aluminum] [aluminum, gutter, bowl, green, ohio] 
16   [aluminum]  [aluminum, gutter, maume, ohio] 
17   [aluminum] [aluminum, gutter, perrysburg, ohio] 
18   [aluminum]  [aluminum, gutter, tecumseh, ohio] 
19 [aluminum, toledo]  [aluminum, gutter, toledo, ohio] 

問題

如果我有名單之列,是有一個熊貓功能,讓我列出的整個陣列上運行檢查交集,並返回一個布爾值或相交的值一個新的系列?

例如,我想熊貓有這樣的一個等價的:

def intersection(df, col1, col2, return_type='boolean'): 
    if return_type == 'boolean': 
     df = df[[col1, col2]] 
     s = [] 
     for idx in df.iterrows(): 
      s.append(any([phrase in idx[1][0] for phrase in idx[1][1]])) 
     S = pd.Series(s) 
     return S 
    elif return_type == 'word': 
     df = df[[col1, col2]] 
     s = [] 
     for idx in df.iterrows(): 
      s.append(', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))])) 
     S = pd.Series(s) 
     return S 

#Create column C in df 
df['C'] = intersection(df, 'A', 'B', 'word') 

...而無需編寫我自己的函數或訴諸for循環。我覺得必須有一種更簡單的方法來比較同一行上的兩列中的列表以查看它們是否相交。

我可以for循環做,但它的醜陋給我

for環路返回一個boolean系列:

for idx in df.iterrows(): 
    any([phrase in idx[1][0] for phrase in idx[1][1]]) 

產地:

False 
False 
True 
True 
True 
True 
True 
True 
True 
True 
True 
True 
True 
True 
True 
True 
True 
True 
True 
True 

或者,找到使用set s相交的單詞:

for idx in df.iterrows(): 
    ', '.join([word for word in list(set(idx[1][0]).intersection(set(idx[1][1])))]) 

'' 
'' 
'system' 
'system' 
'gutter' 
'gutter' 
'gutter' 
'gutter' 
'gutter' 
'gutter' 
'aluminum' 
'aluminum' 
'aluminum' 
'aluminum' 
'aluminum' 
'aluminum' 
'aluminum' 
'aluminum' 
'aluminum' 
'toledo, aluminum' 

回答

4

要檢查是否在df.A每個項目包含在df.B

>>> df.apply(lambda row: all(i in row.B for i in row.A), axis=1) 
0  False 
1  False 
2  True 
3  True 
4  True 
5  True 
6  True 
7  True 
8  True 
9  True 
10  True 
11  True 
12  True 
13  True 
14  True 
15  True 
16  True 
17  True 
18  True 
19  True 
dtype: bool 

要得到工會:

df['intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df.A, df.B)] 

>>> df 
        A          B  intersection 
0    [gener]        [gutter]     [] 
1    [gener]        [gutter]     [] 
2    [system]      [gutter, system]   [system] 
3    [system]    [gutter, guard, system]   [system] 
4    [gutter]       [ohio, gutter]   [gutter] 
5    [gutter]      [gutter, toledo]   [gutter] 
6    [gutter]      [toledo, gutter]   [gutter] 
7    [gutter]        [gutter]   [gutter] 
8    [gutter]        [gutter]   [gutter] 
9    [gutter]        [gutter]   [gutter] 
10   [aluminum] [how, to, instal, aluminum, gutter]   [aluminum] 
11   [aluminum]      [aluminum, gutter]   [aluminum] 
12   [aluminum]    [aluminum, gutter, color]   [aluminum] 
13   [aluminum]      [aluminum, gutter]   [aluminum] 
14   [aluminum]  [aluminum, gutter, adrian, ohio]   [aluminum] 
15   [aluminum] [aluminum, gutter, bowl, green, ohio]   [aluminum] 
16   [aluminum]  [aluminum, gutter, maume, ohio]   [aluminum] 
17   [aluminum] [aluminum, gutter, perrysburg, ohio]   [aluminum] 
18   [aluminum]  [aluminum, gutter, tecumseh, ohio]   [aluminum] 
19 [aluminum, toledo]  [aluminum, gutter, toledo, ohio] [aluminum, toledo] 
+0

準確地說,我希望能夠從這個問題中獲得單行和知識。謝謝! – Jarad

1

只需使用支持apply功能通過pandas,這是偉大的。

因爲你可能有交叉,輔助功能可以像這樣沿着製備,然後用DataFrame.apply功能(參見http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html,注意選項axis=1手段「穿越系列」,而axis=0手段應用」兩個以上的列系列「,其中一個 系列僅僅是數據幀中的一列)。然後將列中的每一行作爲可迭代的Series對象傳遞給所應用的函數。

def intersect(ss): 
    ss = iter(ss) 
    s = set(next(ss)) 
    for t in ss: 
     s.intersection_update(t) # `t' must not be a `set' here, `list' or any `Iterable` is OK 
    return s 

res = df.apply(intersect, axis=1) 

>>> res 
0      {} 
1      {} 
2    {system} 
3    {system} 
4    {gutter} 
5    {gutter} 
6    {gutter} 
7    {gutter} 
8    {gutter} 
9    {gutter} 
10   {aluminum} 
11   {aluminum} 
12   {aluminum} 
13   {aluminum} 
14   {aluminum} 
15   {aluminum} 
16   {aluminum} 
17   {aluminum} 
18   {aluminum} 
19 {aluminum, toledo} 

您可以對輔助功能的結果進行進一步的操作,或者進行類似的修改。

希望這會有所幫助。

+0

我第一次見過'intersection_update'。非常有趣的代碼!感謝您學習的絕佳選擇。 – Jarad

+0

@Jarad很高興你喜歡它。 Python容器類型有很多有用的內置操作。在文檔中探索它們有時很有趣:) – ShellayLee