快速地列拆分成多行大熊貓

，我有以下的數據幀：快速地列拆分成多行大熊貓

import pandas as pd 
df = pd.DataFrame({ 'gene':["foo", 
          "bar // lal", 
          "qux", 
          "woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]}) 
df = df[["gene","cell1","cell2"]] 
df

，看起來像這樣：

Out[6]: 
     gene cell1 cell2 
0   foo  5  12 
1 bar // lal  9  90 
2   qux  1  13 
3   woz  7  87

我想要做的是分裂的「基因「一欄中，這樣就會導致這樣的：

  gene cell1 cell2 
     foo  5  12 
     bar  9  90 
     lal  9  90 
     qux  1  13 
     woz  7  87

我目前的做法是這樣的：

import pandas as pd 
import timeit 

def create(): 
    df = pd.DataFrame({ 'gene':["foo", 
          "bar // lal", 
          "qux", 
          "woz"], 'cell1':[5,9,1,7], 'cell2':[12,90,13,87]}) 
    df = df[["gene","cell1","cell2"]] 

    s = df["gene"].str.split(' // ').apply(pd.Series,1).stack() 
    s.index = s.index.droplevel(-1) 
    s.name = "Genes" 
    del df["gene"] 
    df.join(s) 


if __name__ == '__main__': 
    print(timeit.timeit("create()", setup="from __main__ import create", number=100)) 
    # 0.608163118362

這是非常慢很慢。實際上我有大約40K行來檢查和進程。

這是什麼快速實現？

來源

2015-11-10 neversaint

我猜測緩慢的部分是應用（而不是拆分或堆棧）？ –

是的，我同意，只要你開始在數據框上進行字符串操作，我認爲你開始看到一些減速。不過，在這一點上我想不出什麼才華橫溢。既然你知道你想讓兩行（對於bar // lal）具有相同的值，也許你可以在你的數據框中添加另一行，就像'bar'行一樣，但是使用'lal'。不知道它會更快！ –

TBH我認爲我們需要一種快速內置的方式來標準化這樣的元素..雖然因爲我已經脫離了一點循環，我知道現在有一個，我只是不'不瞭解它。 :-)在此期間，我一直在使用的方法是這樣的：

def create(n): 
    df = pd.DataFrame({ 'gene':["foo", 
           "bar // lal", 
           "qux", 
           "woz"], 
         'cell1':[5,9,1,7], 'cell2':[12,90,13,87]}) 
    df = df[["gene","cell1","cell2"]] 
    df = pd.concat([df]*n) 
    df = df.reset_index(drop=True) 
    return df 

def orig(df): 
    s = df["gene"].str.split(' // ').apply(pd.Series,1).stack() 
    s.index = s.index.droplevel(-1) 
    s.name = "Genes" 
    del df["gene"] 
    return df.join(s) 

def faster(df): 
    s = df["gene"].str.split(' // ', expand=True).stack() 
    i = s.index.get_level_values(0) 
    df2 = df.loc[i].copy() 
    df2["gene"] = s.values 
    return df2

這給了我

>>> df = create(1) 
>>> df 
     gene cell1 cell2 
0   foo  5  12 
1 bar // lal  9  90 
2   qux  1  13 
3   woz  7  87 
>>> %time orig(df.copy()) 
CPU times: user 12 ms, sys: 0 ns, total: 12 ms 
Wall time: 10.2 ms 
    cell1 cell2 Genes 
0  5  12 foo 
1  9  90 bar 
1  9  90 lal 
2  1  13 qux 
3  7  87 woz 
>>> %time faster(df.copy()) 
CPU times: user 16 ms, sys: 0 ns, total: 16 ms 
Wall time: 12.4 ms 
    gene cell1 cell2 
0 foo  5  12 
1 bar  9  90 
1 lal  9  90 
2 qux  1  13 
3 woz  7  87

在低尺寸相當的速度和

>>> df = create(10000) 
>>> %timeit z = orig(df.copy()) 
1 loops, best of 3: 14.2 s per loop 
>>> %timeit z = faster(df.copy()) 
1 loops, best of 3: 231 ms per loop

一個60-在更大的情況下加倍加速。請注意，我在這裏使用df.copy()的唯一原因是因爲orig具有破壞性。

來源

2015-11-10 04:54:14 DSM

這個答案老實說很棒，它對我很有用。我遇到同樣的問題，其中一行中的單元格包含多個備選ID（用分號分隔）。我需要針對所有ID進行搜索，這需要拆分成保留所有其他列數據的多行。 – griffinc

生物信息學中的東西因具有相同「事物」（基因，蛋白質等）的多個ID而臭名昭着， – griffinc

快速地列拆分成多行大熊貓

回答

相關問題