如何在大熊貓選擇的列中的數據幀包含多個參數應用功能

，我有以下的數據幀：如何在大熊貓選擇的列中的數據幀包含多個參數應用功能

import pandas as pd 
data = {'gene':['a','b','c','d','e'], 
     'count':[61,320,34,14,33], 
     'gene_length':[152,86,92,170,111]} 
df = pd.DataFrame(data) 
df = df[["gene","count","gene_length"]]

，看起來像這樣：

In [9]: df 
Out[9]: 
    gene count gene_length 
0 a  61   152 
1 b 320   86 
2 c  34   92 
3 d  14   170 
4 e  33   111

我想要做的是應用功能：

def calculate_RPKM(theC,theN,theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = float((10**9) * theC)/(theN * theL) 
    return rpkm

開，count和gene_length列和恆定N=12345 並將新結果命名爲'rpkm'。但爲什麼這失敗？

N=12345 
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length'])

什麼是正確的做法？第一行應該是這個樣子：

gene count gene_length rpkm 
    a  61   152 32508.366

更新：我得到的錯誤是這樣的：

-------------------------------------------------------------------------- 
TypeError         Traceback (most recent call last) 
<ipython-input-4-6270e1d19b89> in <module>() 
----> 1 df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length']) 

<ipython-input-1-48e311ca02f3> in calculate_RPKM(theC, theN, theL) 
    13  theN == Total reads mapped 
    14  """ 
---> 15  rpkm = float((10**9) * theC)/(theN * theL) 
    16  return rpkm 

/u21/coolme/.anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self) 
    74    return converter(self.iloc[0]) 
    75   raise TypeError(
---> 76    "cannot convert the series to {0}".format(str(converter))) 
    77  return wrapper 
    78

來源

2015-06-15 neversaint

如果失敗，請打印出您正在收到的確切錯誤消息。這使人們更容易幫助你。 –

不要投給float在你的方法，它會很好地工作：

In [9]: 
def calculate_RPKM(theC,theN, theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = ((10**9) * theC)/(theN * theL) 
    return rpkm 
N=12345 
df["rpkm"] = calculate_RPKM(df['count'],N,df['gene_length']) 
df 

Out[9]: 
    gene count gene_length   rpkm 
0 a  61   152 32508.366908 
1 b 320   86 301411.926493 
2 c  34   92 29936.429112 
3 d  14   170 6670.955138 
4 e  33   111 24082.405613

該錯誤消息告訴你，你不能施放一個熊貓系列，以一個float，而你可以撥打電話apply以行方式調用你的方法。你應該看看重寫你的方法，以便它可以在整個Series上工作，這將被矢量化並且比調用apply快得多，其本質上是for循環。

時序

In [11]: 

def calculate_RPKM1(theC,theN, theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = ((10**9) * theC)/(theN * theL) 
    return rpkm 
 
def calculate_RPKM(theC,theN,theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = float((10**9) * theC)/(theN * theL) 
    return rpkm 
N=12345 

%timeit calculate_RPKM1(df['count'],N,df['gene_length']) 
%timeit df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1) 

1000 loops, best of 3: 238 µs per loop 
100 loops, best of 3: 1.5 ms per loop

你可以看到，非鑄造的版本是超過6倍速度更快，甚至會在更大的數據集的高性能

更新

下面的代碼一起與使用非鑄造float版本的方法在語義上是等效的：

df['rpkm'] = calculate_RPKM1(df['count'].astype(float),N,df['gene_length']) 
df 

Out[16]: 
    gene count gene_length   rpkm 
0 a  61   152 32508.366908 
1 b 320   86 301411.926493 
2 c  34   92 29936.429112 
3 d  14   170 6670.955138 
4 e  33   111 24082.405613

來源

2015-06-15 09:02:28 EdChum

如果我刪除'float'，那麼這個'calculate_RPKM（）'將在獨立的情況下給出0。 – neversaint

你可以給我樣品輸入，產生'0'，你可以在你的方法中改變這一行：'rpkm =（（10 ** 9）* theC）/（theN * theL）.astype（float） – EdChum

的DataFrame.apply方法需要一個參數axis其設置爲1時發送整個排入apply函數。這使得它比正常的應用函數慢很多，因爲它不再是一個合適的monoid lambda function。但它確實有效。

像這樣：

N=12345 
df["rpkm"] = df[(['count', 'gene_length'])].apply(lambda x: calculate_RPKM(x[0], N, x[1]), axis=1)

來源

2015-06-15 08:58:23 firelynx

這似乎是簡單地固定通過去除在函數定義浮子要求，操作在兩個系列完全施加向下：

輸出的 df['rpkm']

0  32508.366908 
1 301411.926493 
2  29936.429112 
3  6670.955138 
4  24082.405613 
Name: rpkm, dtype: float64

def calculate_RPKM(theC,theN,theL): 
    """ 
    theC == Total reads mapped to a feature (gene/linc) 
    theL == Length of feature (gene/linc) 
    theN == Total reads mapped 
    """ 
    rpkm = ((10 ** 9) * theC)/(theN * theL) 
    return rpkm 

df['rpkm'] = calculate_RPKM(df['count'], N, df['gene_length'])

如果你想完全確定輸出是一個浮點數，你可以將兩個系列變爲浮點數：

counts = df['count'].astype(float) 
lengths = df['gene_length'].astype(float) 

df['rpkm'] = calculate_RPKM(counts, N, lengths)

來源

2015-06-15 09:08:24 bastewart

如何在大熊貓選擇的列中的數據幀包含多個參數應用功能

回答

相關問題