2016-09-14 50 views
4

考慮下面的數據幀熊貓:如何獲取包含值列表的列的唯一值?

df = pd.DataFrame({'name' : [['one two','three four'], ['one'],[], [],['one two'],['three']], 
        'col' : ['A','B','A','B','A','B']})  
df.sort_values(by='col',inplace=True) 

df 
Out[62]: 
    col     name 
0 A [one two, three four] 
2 A      [] 
4 A    [one two] 
1 B     [one] 
3 B      [] 
5 B    [three] 

我想獲得一個跟蹤列入namecol每個組合的所有唯一字符串的列。

也就是說,預期產量

df 
Out[62]: 
    col     name unique_list 
0 A [one two, three four] [one two, three four] 
2 A      [] [one two, three four] 
4 A    [one two] [one two, three four] 
1 B     [one] [one, three] 
3 B      [] [one, three] 
5 B    [three] [one, three] 

事實上,說爲一組,你可以看到,唯一的一組字符串包含在[one two, three four][][one two][one two]

我能獲得相應使用的唯一值數量Pandas : how to get the unique number of values in cells when cells contain lists?

df['count_unique']=df.groupby('col')['name'].transform(lambda x: list(pd.Series(x.apply(pd.Series).stack().reset_index(drop=True, level=1).nunique()))) 


df 
Out[65]: 
    col     name count_unique 
0 A [one two, three four]   2 
2 A      []   2 
4 A    [one two]   2 
1 B     [one]   2 
3 B      []   2 
5 B    [three]   2 

,但替換nuniqueunique以上失敗。

任何想法? 謝謝!

回答

2

下面是解

df['unique_list'] = df.col.map(df.groupby('col')['name'].sum().apply(np.unique)) 
    df 

enter image description here

+0

有趣。 '總和'字符串?! –

+1

@Noobie它比這更糟糕。它是名單上的太陽。它生成一個連接列表,我在這個連接列表中應用nhe.nif.unique – piRSquared

+0

hehehe。我只是嘗試,但似乎你有很好的解決方案失敗,當有遺漏值col。在這種情況下,我得到'TypeError:只能連接列表(而不是「int」)到列表。用'fillna('')'或'fillna('[]')替換缺失的值不起作用。有任何想法嗎? –

2

嘗試:

uniq_df = df.groupby('col')['name'].apply(lambda x: list(set(reduce(lambda y,z: y+z,x)))).reset_index() 
uniq_df.columns = ['col','uniq_list'] 
pd.merge(df,uniq_df, on='col', how='left') 

所需的輸出:

col     name    uniq_list 
0 A [one two, three four] [one two, three four] 
1 A      [] [one two, three four] 
2 A    [one two] [one two, three four] 
3 B     [one]   [three, one] 
4 B      []   [three, one] 
5 B    [three]   [three, one] 
+0

感謝@abdou!讓我試試 –