2015-06-15 30 views
4

,我有以下的數據幀:如何在大熊貓選擇的列中的數據幀包含多個參數應用功能

import pandas as pd 
data = {'gene':['a','b','c','d','e'], 
     'count':[61,320,34,14,33], 
     'gene_length':[152,86,92,170,111]} 
df = pd.DataFrame(data) 
df = df[["gene","count","gene_length"]] 

,看起來像這樣:

In [9]: df 
Out[9]: 
    gene count gene_length 
0 a  61   152 
1 b 320   86 
2 c  34   92 
3 d  14   170 
4 e  33   111 

我想要做的是應用功能:

def calculate_RPKM(theC,theN,theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = float((10**9) * theC)/(theN * theL) 
    return rpkm 

開,countgene_length列和恆定N=12345 並將新結果命名爲'rpkm'。 但爲什麼這失敗?

N=12345 
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length']) 

什麼是正確的做法? 第一行應該是這個樣子:

gene count gene_length rpkm 
    a  61   152 32508.366 

更新:我得到的錯誤是這樣的:

-------------------------------------------------------------------------- 
TypeError         Traceback (most recent call last) 
<ipython-input-4-6270e1d19b89> in <module>() 
----> 1 df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length']) 

<ipython-input-1-48e311ca02f3> in calculate_RPKM(theC, theN, theL) 
    13  theN == Total reads mapped 
    14  """ 
---> 15  rpkm = float((10**9) * theC)/(theN * theL) 
    16  return rpkm 

/u21/coolme/.anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self) 
    74    return converter(self.iloc[0]) 
    75   raise TypeError(
---> 76    "cannot convert the series to {0}".format(str(converter))) 
    77  return wrapper 
    78 
+0

如果失敗,請打印出您正在收到的確切錯誤消息。這使人們更容易幫助你。 –

回答

1

不要投給float在你的方法,它會很好地工作:

In [9]: 
def calculate_RPKM(theC,theN, theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = ((10**9) * theC)/(theN * theL) 
    return rpkm 
N=12345 
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length']) 
df 

Out[9]: 
    gene count gene_length   rpkm 
0 a  61   152 32508.366908 
1 b 320   86 301411.926493 
2 c  34   92 29936.429112 
3 d  14   170 6670.955138 
4 e  33   111 24082.405613 

該錯誤消息告訴你,你不能施放一個熊貓系列,以一個float,而你可以撥打電話apply以行方式調用你的方法。你應該看看重寫你的方法,以便它可以在整個Series上工作,這將被矢量化並且比調用apply快得多,其本質上是for循環。

時序

In [11]: 

def calculate_RPKM1(theC,theN, theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = ((10**9) * theC)/(theN * theL) 
    return rpkm 
​ 
def calculate_RPKM(theC,theN,theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = float((10**9) * theC)/(theN * theL) 
    return rpkm 
N=12345 

%timeit calculate_RPKM1(df['count'],N,df['gene_length']) 
%timeit df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1) 

1000 loops, best of 3: 238 µs per loop 
100 loops, best of 3: 1.5 ms per loop 

你可以看到,非鑄造的版本是超過6倍速度更快,甚至會在更大的數據集的高性能

更新

下面的代碼一起與使用非鑄造float版本的方法在語義上是等效的:

df['rpkm'] = calculate_RPKM1(df['count'].astype(float),N,df['gene_length']) 
df 

Out[16]: 
    gene count gene_length   rpkm 
0 a  61   152 32508.366908 
1 b 320   86 301411.926493 
2 c  34   92 29936.429112 
3 d  14   170 6670.955138 
4 e  33   111 24082.405613 
+0

如果我刪除'float',那麼這個'calculate_RPKM()'將在獨立的情況下給出0。 – neversaint

+1

你可以給我樣品輸入,產生'0',你可以在你的方法中改變這一行:'rpkm =((10 ** 9)* theC)/(theN * theL).astype(float) – EdChum

1

DataFrame.apply方法需要一個參數axis其設置爲1時發送整個排入apply函數。這使得它比正常的應用函數慢很多,因爲它不再是一個合適的monoid lambda function。但它確實有效。

像這樣:

N=12345 
df["rpkm"] = df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1) 
1

這似乎是簡單地固定通過去除在函數定義浮子要求,操作在兩個系列完全施加向下:

輸出的 df['rpkm']

0  32508.366908 
1 301411.926493 
2  29936.429112 
3  6670.955138 
4  24082.405613 
Name: rpkm, dtype: float64 

def calculate_RPKM(theC,theN,theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = ((10 ** 9) * theC)/(theN * theL) 
    return rpkm 

df['rpkm'] = calculate_RPKM(df['count'], N, df['gene_length']) 

如果你想完全確定輸出是一個浮點數,你可以將兩個系列變爲浮點數:

counts = df['count'].astype(float) 
lengths = df['gene_length'].astype(float) 

df['rpkm'] = calculate_RPKM(counts, N, lengths)