2016-08-09 48 views
1

我有一個數據幀,其頭部看起來像以下, 新增欄基於多列

df.head() 
Out[660]: 
Samples variable value Type 
0 PE01I 267N12.3_Beta 0.066517 Beta 
1 PE01R R267N12.3_Beta 0.061617 Beta 
2 PE02I 267N12.3_Beta 0.071013 Beta 
3 PE02R 267N12.3_Beta 0.056623 Beta 
4 PE03I 267N12.3_Beta 0.071633 Beta 
5 PE01I 267N12.3_FPKM 0.000000 FPKM 
6 PE01R 267N12.3_FPKM 0.003430 FPKM 
7 PE02I 267N12.3_FPKM 0.272144 FPKM 
8 PE02R 267N12.3_FPKM 0.005753 FPKM 
9 PE03I 267N12.3_FPKM 0.078708 FPKM 

的值的數據幀,我想以頭名的Beta版和FPKM添加新列通過基於「值」列中的相應值從列「類型」中使用。 到目前爲止,我通過以下的一行,

df['Beta'] = df['Type'].map(lambda x: df.value if x == "Beta" else "FPKM") 

嘗試這樣做,它給中小企業以下的輸出,

Samples variable value Type      Beta 
0 PE01I 267N12.3_Beta 0.066517 Beta  0 0.066517 1 0.061617 2 0.07... 
1 PE01R 267N12.3_Beta 0.061617 Beta 0 0.066517 1 0.061617 2 0.07... 
2 PE02I 267N12.3_Beta 0.071013 Beta 0 0.066517 1 0.061617 2 0.07... 
3 PE02R 267N12.3_Beta 0.056623 Beta 0 0.066517 1 0.061617 2 0.07... 
4 PE03I 267N12.3_Beta 0.071633 Beta 0 0.066517 1 0.061617 2 0.07... 

列測試版有三個值,所有列重複。 什麼我的目標是有一個數據幀,它看起來像,

Samples variable Beta FPKM 
PE01I 267N12.3_Beta 0.066517 0 
PE01R 267N12.3_Beta 0.061617 0.00343 
PE02I 267N12.3_Beta 0.071013 0.272144 
PE02R 267N12.3_Beta 0.056623 0.005753 
PE03I 267N12.3_Beta 0.071633 0.078708 

任何幫助將是真正偉大的.. 謝謝

+0

使用'where'功能 – Julien

回答

1

我想你需要unstack

df1 = df.set_index(['Samples','Type']).unstack() 
print (df1) 
       variable     value   
Type    Beta   FPKM  Beta  FPKM 
Samples             
PE01I  267N12.3_Beta 267N12.3_FPKM 0.066517 0.000000 
PE01R R267N12.3_Beta 267N12.3_FPKM 0.061617 0.003430 
PE02I  267N12.3_Beta 267N12.3_FPKM 0.071013 0.272144 
PE02R  267N12.3_Beta 267N12.3_FPKM 0.056623 0.005753 
PE03I  267N12.3_Beta 267N12.3_FPKM 0.071633 0.078708 

#remove Multiindex in columns 
df1.columns = ['_'.join(col) for col in df1.columns] 
df1.reset_index(inplace=True) 
print (df1) 
    Samples variable_Beta variable_FPKM value_Beta value_FPKM 
0 PE01I 267N12.3_Beta 267N12.3_FPKM 0.066517 0.000000 
1 PE01R R267N12.3_Beta 267N12.3_FPKM 0.061617 0.003430 
2 PE02I 267N12.3_Beta 267N12.3_FPKM 0.071013 0.272144 
3 PE02R 267N12.3_Beta 267N12.3_FPKM 0.056623 0.005753 
4 PE03I 267N12.3_Beta 267N12.3_FPKM 0.071633 0.078708 

#if need remove column 
print (df1.drop('variable_FPKM', axis=1)) 
    Samples variable_Beta value_Beta value_FPKM 
0 PE01I 267N12.3_Beta 0.066517 0.000000 
1 PE01R R267N12.3_Beta 0.061617 0.003430 
2 PE02I 267N12.3_Beta 0.071013 0.272144 
3 PE02R 267N12.3_Beta 0.056623 0.005753 
4 PE03I 267N12.3_Beta 0.071633 0.078708 

編輯點評:

如果得到錯誤:

ValueError: Index contains duplicate entries, cannot reshape

這意味着您有index中的重複值,並且aggragating是必需的。

你需要pivot_table如果aggfunc是np.sumnp.mean(與數字工作),字符串列被遺漏的和功能''.join只有字符串值和數字作品都被遺漏。不同aggfunc然後

呼叫功能使用兩次concat

import pandas as pd 

df = pd.DataFrame({'Type': {0: 'Beta', 1: 'Beta', 2: 'Beta', 3: 'Beta', 4: 'Beta', 5: 'FPKM', 6: 'FPKM', 7: 'FPKM', 8: 'FPKM', 9: 'FPKM'}, 'value': {0: 0.066516999999999993, 1: 0.061616999999999998, 2: 0.071012999999999993, 3: 0.056623, 4: 0.071633000000000002, 5: 0.0, 6: 0.0034299999999999999, 7: 0.272144, 8: 0.0057530000000000003, 9: 0.078708}, 'variable': {0: '267N12.3_Beta', 1: 'R267N12.3_Beta', 2: '267N12.3_Beta', 3: '267N12.3_Beta', 4: '267N12.3_Beta', 5: '267N12.3_FPKM', 6: '267N12.3_FPKM', 7: '267N12.3_FPKM', 8: '267N12.3_FPKM', 9: '267N12.3_FPKM'}, 'Samples': {0: 'PE01I', 1: 'PE01I', 2: 'PE02I', 3: 'PE02R', 4: 'PE03I', 5: 'PE01I', 6: 'PE01R', 7: 'PE02I', 8: 'PE02R', 9: 'PE03I'}}) 

#changed value in second row in column Samples 
print (df) 
    Samples Type  value  variable 
0 PE01I Beta 0.066517 267N12.3_Beta 
1 PE01I Beta 0.061617 R267N12.3_Beta 
2 PE02I Beta 0.071013 267N12.3_Beta 
3 PE02R Beta 0.056623 267N12.3_Beta 
4 PE03I Beta 0.071633 267N12.3_Beta 
5 PE01I FPKM 0.000000 267N12.3_FPKM 
6 PE01R FPKM 0.003430 267N12.3_FPKM 
7 PE02I FPKM 0.272144 267N12.3_FPKM 
8 PE02R FPKM 0.005753 267N12.3_FPKM 
9 PE03I FPKM 0.078708 267N12.3_FPKM 
df1 = df.pivot_table(index='Samples', columns=['Type'], aggfunc=','.join) 
print (df1) 
          variable    
Type        Beta   FPKM 
Samples            
PE01I 267N12.3_Beta,R267N12.3_Beta 267N12.3_FPKM 
PE01R       None 267N12.3_FPKM 
PE02I     267N12.3_Beta 267N12.3_FPKM 
PE02R     267N12.3_Beta 267N12.3_FPKM 
PE03I     267N12.3_Beta 267N12.3_FPKM 

df2 = df.pivot_table(index='Samples', columns=['Type'], aggfunc=np.mean) 
print (df2) 
      value   
Type   Beta  FPKM 
Samples      
PE01I 0.064067 0.000000 
PE01R   NaN 0.003430 
PE02I 0.071013 0.272144 
PE02R 0.056623 0.005753 
PE03I 0.071633 0.078708 

df3 = pd.concat([df1, df2], axis=1) 
df3.columns = ['_'.join(col) for col in df3.columns] 
df3.reset_index(inplace=True) 
print (df3) 
    Samples     variable_Beta variable_FPKM value_Beta value_FPKM 
0 PE01I 267N12.3_Beta,R267N12.3_Beta 267N12.3_FPKM 0.064067 0.000000 
1 PE01R       None 267N12.3_FPKM   NaN 0.003430 
2 PE02I     267N12.3_Beta 267N12.3_FPKM 0.071013 0.272144 
3 PE02R     267N12.3_Beta 267N12.3_FPKM 0.056623 0.005753 
4 PE03I     267N12.3_Beta 267N12.3_FPKM 0.071633 0.078708 
+0

列樣品有重複值及其投擲誤差,ValueError異常:索引包含重複的條目,不能重塑 – user1017373

+1

我在添加索引解決方案,問題的重複,請參閱編輯。 – jezrael

+0

謝謝.. !!它的工作 – user1017373

1

你可以根據自己的Type柱分離到2個dataframes後使用merge

In [14]: df_1 = df.loc[(df['Type'] == "Beta"), ['Samples', 'variable', 'value']] 

In [15]: df_2 = df.loc[(df['Type'] == "FPKM"), ['Samples', 'value']] 

In [16]: df_1['Beta'] = df_1['value'] 

In [17]: df_2['FPKM'] = df_2['value'] 

In [18]: df_1[['Samples', 'variable', 'Beta']].merge(df_2[['Samples', 'FPKM']], on="Samples") 
Out[18]: 
    Samples  variable  Beta  FPKM 
0 PE01I 267N12.3_Beta 0.066517 0.000000 
1 PE01R R267N12.3_Beta 0.061617 0.003430 
2 PE02I 267N12.3_Beta 0.071013 0.272144 
3 PE02R 267N12.3_Beta 0.056623 0.005753 
4 PE03I 267N12.3_Beta 0.071633 0.078708 
+0

實際上,謝謝你,我正在展示的這個DF是來自一個更大的數據框架,我將它融合到這個結構中。 – user1017373