因此,我已這個pandas.Dataframepandas.apply期望輸出形狀(傳遞的值的形狀爲(x,),指數暗示(X,Y))
C1 C2 C3 C4 C5 Start End C8
A 1 - - - 1 4 -
A 2 - - - 6 10 -
A 3 - - - 11 14 -
A 4 - - - 15 19 -
其中 - 是對象,開始是初始座標和結束是每個元素的最終座標。
我定義了這個函數來計算表中所有間隔的聯合,在這個例子中它應該總和爲[1,19] - {5}(基本上是一個包含所有包含元素的numpy數組)。
def coverage(table):
#return a dataframe with the coverage of each individual peptide in a protein
interval = (table.apply(lambda row : range(int(row['Start']),int(row['End'])+1),axis=1))]
#if there is only one peptide, return the range between its start and end positions
if len(table) == 1: return asarray(range(int(table['Start']),int(table['End'])+1))
#if there are more, unite all the intervals
if len(table) > 1:
return reduce(union1d,(list(interval)))
因此,我將該函數迭代地應用於多個DataFrame(第一個是A,然後是B,C等)。問題是,對於一些表失敗,並贈送此錯誤:
Traceback (most recent call last):
File "At_coverage.py", line 37, in <module>
covdir[prot] = coverage(data)
File "At_coverage.py", line 21, in coverage
interval = (table.apply(lambda row : range(int(row['Start']),int(row['End'])+1),axis=1))
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 3312, in apply
return self._apply_standard(f, axis, reduce=reduce)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 3417, in _apply_standard
result = self._constructor(data=results, index=index)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 201, in __init__
mgr = self._init_dict(data, index, columns, dtype=dtype)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 323, in _init_dict
dtype=dtype)
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 4473, in _arrays_to_mgr
return create_block_manager_from_arrays(arrays, arr_names, axes)
File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 3760, in create_block_manager_from_arrays
construction_error(len(arrays), arrays[0].shape[1:], axes, e)
File "/usr/lib/python2.7/dist-packages/pandas/core/internals.py", line 3732, in construction_error
passed,implied))
ValueError: Shape of passed values is (7,), indices imply (7, 8)
與它未能在下列數據框:
Protein Peptide \
11106 sp|Q75W54|EBM_ARATH GJDGFJK
11107 sp|Q75W54|EBM_ARATH GJDGFJK
11108 sp|Q75W54|EBM_ARATH JJDPHJVSTFFDDYKR
11109 sp|Q75W54|EBM_ARATH JJDPHJVSTFFDDYKR
11110 sp|Q75W54|EBM_ARATH JNGEPJFJR
11111 sp|Q75W54|EBM_ARATH JNGEPJFJR
11112 sp|Q75W54|EBM_ARATH JNGEPJFJR
Fraction Count \
11106 AT_indark_IEX_fraction_18a_20150422.uniprot-pr... 2
11107 AT_indark_IEX_fraction_21a_20150422.uniprot-pr... 2
11108 AT_indark_IEX_fraction_18a_20150422.uniprot-pr... 2
11109 AT_indark_IEX_fraction_19a_20150422.uniprot-pr... 1
11110 AT_indark_IEX_fraction_19a_20150422.uniprot-pr... 2
11111 AT_indark_IEX_fraction_22a_20150422.uniprot-pr... 2
11112 AT_indark_IEX_fraction_25a_20150422.uniprot-pr... 2
Sequence Start End Length
11106 MAEIGKTVLDFGWIAARSTEVDVNGVQLTTTNPPAISSESRWMEAA... 577 584 944
11107 MAEIGKTVLDFGWIAARSTEVDVNGVQLTTTNPPAISSESRWMEAA... 577 584 944
11108 MAEIGKTVLDFGWIAARSTEVDVNGVQLTTTNPPAISSESRWMEAA... 210 226 944
11109 MAEIGKTVLDFGWIAARSTEVDVNGVQLTTTNPPAISSESRWMEAA... 210 226 944
11110 MAEIGKTVLDFGWIAARSTEVDVNGVQLTTTNPPAISSESRWMEAA... 344 353 944
11111 MAEIGKTVLDFGWIAARSTEVDVNGVQLTTTNPPAISSESRWMEAA... 344 353 944
11112 MAEIGKTVLDFGWIAARSTEVDVNGVQLTTTNPPAISSESRWMEAA... 344 353 944
[7 rows x 8 columns]
爲了使它工作,我更換了第三行與
interval = (table.apply(lambda row : range(int(row['Start']),int(row['End'])+4),axis=1)).apply(lambda row: row[:-3])
,我注意到它也適用於任何其他數量比+1(儘管有一些人在它的另一個數據框崩潰後的環路。
所以這個解決方案是多餘的和愚蠢的。 MY HYPOTHESIS是這個特定數據框中的行數匹配一些奇怪的參數(比如列數或類似的東西),這使Pandas試圖簡化某些東西然後崩潰。
我做的,也適用於多次啓動和結束程序的簡化版本:
def multicov(row):
intervals = []
for i in range(len(row['Start'])):
#print data
intervals.append((range(int(row['Start'][i]),int(row['End'][i])+1)))
return reduce(union1d,intervals)
dir = {'Start':[[1,7],[14]],
'End':[[5,10],[18]]}
df = DataFrame(dir,columns=['Start','End'])
print df
print df.apply(multicov,axis=1)
在這種情況下,贈送了同樣的錯誤
ValueError: Shape of passed values is (2,), indices imply (2, 2)
但有趣的是,如果我回到函數中的兩個元素(以便它匹配2,2)表現良好。
return reduce(union1d,intervals),'foobar'
Start End
0 [1, 7] [5, 10]
1 [14] [18]
[2 rows x 2 columns]
0 ([1, 2, 3, 4, 7, 8, 9, 10], foobar)
1 ([14, 15, 16, 17, 18], foobar)
dtype: object
如果我指定輸出作爲一個列表,
return [reduce(union1d,intervals),'foobar']
它前面的列名的輸出相匹配!
Start End
0 [1, 7] [5, 10]
1 [14] [18]
[2 rows x 2 columns]
Start End
0 [1, 2, 3, 4, 7, 8, 9] foobar
1 [14, 15, 16, 17] foobar
[2 rows x 2 columns]
所以我認爲錯誤與熊貓試圖迫使我以前的數據幀,並從輸出一個與一些兼容性做的,但我很驚訝,對於大多數DataFrames它工作得很好!
爲什麼使用元組()不是Python的?我被告知永遠不要在熊貓上迭代,這不是非常優化的。 – Nico
@Nico是的,矢量化的代碼比循環要快得多。但是在這裏,'apply'只是遍歷行,不能被cython化或向量化。而且,創建未使用的數據幀還有一些開銷,並且使用union1d進行多次調用,這會多次調用。 – ptrj