2015-08-31 65 views
2

這是指在使用SAS之前回答的問題。 SAS - transpose multiple variables in rows to columns使用熊貓根據羣組將多個變量按列轉置爲列

新的事情是,變量的長度不是兩個,而是各不相同。這裏有一個例子:

acct  la ln seq1 seq2 
0 9999 20.01 100  1 10 
1 9999 19.05 1  1 10 
2 9999 30.00 1  1 10 
3 9999 26.77 100  2 11 
4 9999 24.96 1  2 11 
5 8888 38.43 218  3 20 
6 8888 37.53 1  3 20 

我所需的輸出是:

acct  la ln seq1 seq2 la0 la1 la2 la3 ln0 ln1 ln2 
5 8888 38.43 218  3 20 38.43 37.53 NaN NaN 218 1 NaN 
0 9999 20.01 100  1 10 20.01 19.05 30 NaN 100 1 1 
3 9999 26.77 100  2 11 26.77 24.96 NaN NaN 100 1 NaN 

在SAS我可以用PROC總結,但是我想要得到它在Python這樣做,因爲我不能用這是相當簡單的SAS不再。

我已經解決了我可以重複使用的問題,但我想知道在熊貓中是否有更容易的選項,我沒有看到。這是我的解決方案。如果有人有更快的方法會很有趣!

# write multiple row to col based on groupby 

import pandas as pd 
from pandas import DataFrame 
import numpy as np 

data = DataFrame({ 
    "acct": [9999, 9999, 9999, 9999, 9999, 8888, 8888], 
    "seq1": [1, 1, 1, 2, 2, 3, 3], 
    "seq2": [10, 10, 10, 11, 11, 20, 20], 
    "la": [20.01, 19.05, 30, 26.77, 24.96, 38.43, 37.53], 
    "ln": [100, 1, 1, 100, 1, 218, 1] 
    }) 

# group the variables by some classes 
grouped = data.groupby(["acct", "seq1", "seq2"]) 

def rows_to_col(column, size): 
    # create head and contain to iterate through the groupby values 
    head = [] 
    contain = [] 
    for i,j in grouped: 
     head.append(i) 
     contain.append(j) 

    # transpose the values in contain 
    contain_transpose = [] 
    for i in range(0,len(contain)): 
     contain_transpose.append(contain[i][column].tolist()) 

    # determine the longest list of a sublist 
    length = len(max(contain_transpose, key = len)) 
    # assign missing values to sublist smaller than longest list 
    for i in range(0, len(contain_transpose)): 
     if len(contain_transpose[i]) != length: 
      contain_transpose[i].append("NaN" * (length - len(contain_transpose[i]))) 

    # create columns for the transposed column values 
    for i in range(0, len(contain)): 
     for j in range(0, size): 
      contain[i][column + str(j)] = np.nan 

    # assign the transposed values to the column 
    for i in range(0, len(contain)): 
     for j in range(0, length): 
      contain[i][column + str(j)] = contain_transpose[i][j] 

    # now always take the first values of the grouped group 
    concat_list = [] 

    for i in range(0, len(contain)): 
     concat_list.append(contain[i][:1]) 

    return pd.concat(concat_list) # concate the list 

# fill in column name and expected size of the column 
data_la = rows_to_col("la", 4) 
data_ln = rows_to_col("ln", 3) 

# merge the two data frames together 
cols_use = data_ln.columns.difference(data_la.columns) 

data_final = pd.merge(data_la, data_ln[cols_use], left_index=True, right_index=True, how="outer") 
data_final.drop(["la", "ln"], axis = 1) 

回答

1

需要注意的是:

In [58]: 

print grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack() 
        0  1 2 
acct seq1 seq2     
8888 3 20 38.43 37.53 NaN 
9999 1 10 20.01 19.05 30 
    2 11 26.77 24.96 NaN 

和:

In [59]: 

print grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack() 
        0 1 2 
acct seq1 seq2    
8888 3 20 218 1 NaN 
9999 1 10 100 1 1 
    2 11 100 1 NaN 

因此:

In [60]: 

df2 = pd.concat((grouped.la.apply(lambda x: pd.Series(data=x.values)).unstack(), 
       grouped.ln.apply(lambda x: pd.Series(data=x.values)).unstack()), 
       keys= ['la', 'ln'], axis=1) 
print df2 
        la    ln  
        0  1 2 0 1 2 
acct seq1 seq2        
8888 3 20 38.43 37.53 NaN 218 1 NaN 
9999 1 10 20.01 19.05 30 100 1 1 
    2 11 26.77 24.96 NaN 100 1 NaN 

唯一的問題是,列索引是MultiIndex。如果我們不想要的話,我們可以通過它們變換爲la0....

df2.columns = map(lambda x: x[0]+str(x[1]), df2.columns.tolist()) 

我不知道你怎麼想。但我更喜歡SASPROC TRANSPOSE語法更好的可讀性。在這種特殊情況下,Pandas語法簡潔但不易讀。

+0

很酷這個要短得多! thx –

+0

希望它也更快。 Python中的循環通常很慢。總是樂於幫助'SAS'er的同伴。 –