2016-03-03 81 views
1

我有以下的數據幀:快速分離行

import pandas as pd 
df = pd.DataFrame({'Probes':["1415693_at","1415693_at"], 
        'Genes':["Canx","LOC101056688 /// Wars "], 
        'cv_filter':[ 0.134,0.290], 
        'Organ' :["LN","LV"]} )  
df = df[["Probes","Genes","cv_filter","Organ"]] 

它看起來像這樣:

In [16]: df 
Out[16]: 
     Probes     Genes cv_filter Organ 
0 1415693_at     Canx  0.134 LN 
1 1415693_at LOC101056688 /// Wars  0.290 LV 

我想要做的就是拆分行基於其中條目 的基因列由'///'分隔。

我希望得到的結果是

 Probes     Genes cv_filter Organ 
0 1415693_at     Canx  0.134 LN 
1 1415693_at   LOC101056688  0.290 LV 
2 1415693_at     Wars  0.290 LV 

我總共有15萬〜行檢查。有沒有一種快速的方法來處理?

回答

1

你可以嘗試先str.splitGenes,創造新的Seriesjoin它原來df

import pandas as pd 
df = pd.DataFrame({'Probes':["1415693_at","1415693_at"], 
        'Genes':["Canx","LOC101056688 /// Wars "], 
        'cv_filter':[ 0.134,0.290], 
        'Organ' :["LN","LV"]} )  
df = df[["Probes","Genes","cv_filter","Organ"]] 
print df 
     Probes     Genes cv_filter Organ 
0 1415693_at     Canx  0.134 LN 
1 1415693_at LOC101056688 /// Wars  0.290 LV 

s = pd.DataFrame([ x.split('///') for x in df['Genes'].tolist() ], index=df.index).stack() 
#or you can use approach from comment 
#s = df['Genes'].str.split('///', expand=True).stack() 

s.index = s.index.droplevel(-1) 
s.name = 'Genes' 
print s 
0    Canx 
1 LOC101056688 
1   Wars 
Name: Genes, dtype: object 

#remove original columns, because error: 
#ValueError: columns overlap but no suffix specified: Index([u'Genes'], dtype='object')  
df = df.drop('Genes', axis=1) 

df = df.join(s).reset_index(drop=True) 
print df[["Probes","Genes","cv_filter","Organ"]] 
     Probes   Genes cv_filter Organ 
0 1415693_at   Canx  0.134 LN 
1 1415693_at LOC101056688  0.290 LV 
2 1415693_at   Wars  0.290 LV 
+0

爲什麼不'DF [ '基因'] str.split( '///',擴大= True).stack()'而不是'df ['Genes']。str.split('///')。apply(pd.Series,1).stack()'。它快了兩倍 –

+0

@AntonProtopopov - 謝謝。我將它添加到我的答案中作爲替代解決方案(只比DataFrame構造函數慢一點點)。 – jezrael

+0

對於那個解決方案你的's'是沒有多索引的DataFrame .. –