2016-11-22 70 views
2

我有一個熊貓數據框,其中一列是一個系列本身。例如:分組的熊貓列(這是一個系列本身)的合併系列

df.head() 

Col1 Col2 
1  ["name1","name2","name3"] 
1  ["name3","name2","name4"] 
2  ["name1","name2","name3"] 
2  ["name1","name5","name6"] 

我需要在Col1組中連接Col2。我想是這樣

Col1 Col2 
1  ["name1","name2","name3","name4"] 
2  ["name1","name2","name3","name5","name6"] 

我試圖使用GROUPBY作爲

.agg({"Col2":lambda x: pd.Series.append(x)}) 

但是,這將引發錯誤,說需要兩個參數。我也嘗試在agg函數中使用sum。這種失敗並不會減少錯誤。

我該怎麼做?

回答

1

您可以使用groupbyapply自定義函數,其中由chain(最快solution)首先壓平嵌套列表,然後通過set刪除重複,轉換爲list和最後一個排序:

import pandas as pd 
from itertools import chain 

df = pd.DataFrame({'Col1':[1,1,2,2], 
        'Col2':[["name1","name2","name3"], 
          ["name3","name2","name4"], 
          ["name1","name2","name3"], 
          ["name1","name5","name6"]]}) 

print (df) 
    Col1     Col2 
0  1 [name1, name2, name3] 
1  1 [name3, name2, name4] 
2  2 [name1, name2, name3] 
3  2 [name1, name5, name6] 
print (df.groupby('Col1')['Col2'] 
     .apply(lambda x: sorted(list(set(list(chain.from_iterable(x)))))) 
     .reset_index()) 
    Col1         Col2 
0  1   [name1, name2, name3, name4] 
1  2 [name1, name2, name3, name5, name6] 

解決方案可以更簡化,只需要chain,setsorted

print (df.groupby('Col1')['Col2'] 
     .apply(lambda x: sorted(set(chain.from_iterable(x)))) 
     .reset_index()) 

    Col1         Col2 
0  1   [name1, name2, name3, name4] 
1  2 [name1, name2, name3, name5, name6] 
1

是的,你不能在這樣的分類數據上使用.aggby{}。無論如何,這是我的問題,使用numpy的幫助。 (註釋爲清晰起見)

import numpy as np 

# Set group by ("Col1") unique values 
groupby = df["Col1"].unique() 

# Create empty dict to store values on each iteration 
d = {} 

for i,val in enumerate(groupby): 

    # Set "Col1" key, to the unique value (e.g., 1) 
    d.setdefault("Col1",[]).append(val) 

    # Create empty list to store values from "Col2" 
    col2_unis=[] 

    # Create sub-DataFrame for each unique groupby value 
    sdf = df.loc[df["Col1"]==val] 

    # Loop through the 2D-array/Series "Col2" and append each 
    # value to col_unis (using list comprehension) 
    col2_unis.append([[j for j in array] for i,array in enumerate(sdf["Col2"].values)]) 

    # Set "Col2" key, to be unique values of col2_unis 
    d.setdefault("Col2",[]).append(np.unique(col2_unis)) 

new_df = pd.DataFrame(d) 

print(new_df) 

更濃縮版本會是什麼樣子:

d = {} 
for i,val in enumerate(df["Col1"].unique()): 
    d.setdefault("Col1",[]).append(val) 
    sdf = df.loc[df["Col1"]==val] 
    d.setdefault("Col2",[]).append(np.unique([[j for j in array] for i,array in enumerate(df.loc[df["Col1"]==val, "Col2"].values)])) 
new_df = pd.DataFrame(d) 
print(new_df) 

瞭解更多關於Python的.setdefault()功能字典,通過檢查this related SO question