2016-04-04 15 views
4

我正嘗試用python中的pandas從我的數據(化學物質和蛋白質之間的得分)創建一個數據框。如何根據Python(pandas)列中的出現次序對數據框進行排序

我想讓我的數據框首先顯示出現最多的蛋白質,所以我先前對數據進行了排序。但是當我製作數據框時,它沒有得到預期的結果。

這裏是我的數據樣本:

chemicals prots scores 
CID000000006 10116.ENSRNOP00000003921 196 
CID000000051 10116.ENSRNOP00000003921 246 
CID000000085 10116.ENSRNOP00000003921 196 
CID000000119 10116.ENSRNOP00000003921 247 
CID000000134 10116.ENSRNOP00000008952 159 
CID000000135 10116.ENSRNOP00000008952 157 
CID000000174 10116.ENSRNOP00000008952 439 
CID000000175 10116.ENSRNOP00000001021 858 
CID000000177 10116.ENSRNOP00000004027 760 

正如你可以看到「10116.ENSRNOP00000003921」是在我的數據中最OCCURENCES蛋白質。

所以我想獲得這樣的:

   10116.ENSRNOP00000003921  10116.ENSRNOP00000008952 
CID000000006 196     
CID000000051 246 
CID000000085 196 
CID000000119 247 
CID000000134         159 
CID000000135         157 
CID000000174         439 

,這裏是我的代碼:

import pandas as pd 

df_rat= pd.read_csv("dt_matrix_rat.csv",sep="\t", header=True) 
df_rat.columns = ['chemicals','proteins','scores'] 
df_rat1 = df_rat.pivot(index='chemicals', columns='proteins', values='scores') 

df_rat1.to_csv("rat_matrix.csv", sep='\t', index=True ) 
+0

你可以用'collections.Counter.most_common()'得到OCCURENCES的數量,但似乎你已經有了那些...的休息看起來像一個數據透視表:https://en.wikipedia.org/wiki/Pivot_table – Swier

+0

數據已經按蛋白質發生排序,這只是我得到的矩陣不顯示在正確的順序結果 –

+0

@ELWalou,你的意思是錯誤的列或行的順序? – MaxU

回答

0

我想你需要sort_valuesnotnullsum,並索引到cols。鐳石光電利用subset

df1 = df.pivot(index='chemicals', columns='proteins', values='scores') 

cols = df1.notnull().sum(axis=0).sort_values(ascending=False).index 
print cols 
Index([u'10116.ENSRNOP00000003921', u'10116.ENSRNOP00000008952', 
     u'10116.ENSRNOP00000004027', u'10116.ENSRNOP00000001021'], 
     dtype='object', name=u'proteins') 

print df1[cols] 
proteins  10116.ENSRNOP00000003921 10116.ENSRNOP00000008952 \ 
chemicals               
CID000000006      196.0      NaN 
CID000000051      246.0      NaN 
CID000000085      196.0      NaN 
CID000000119      247.0      NaN 
CID000000134      NaN      159.0 
CID000000135      NaN      157.0 
CID000000174      NaN      439.0 
CID000000175      NaN      NaN 
CID000000177      NaN      NaN 

proteins  10116.ENSRNOP00000004027 10116.ENSRNOP00000001021 
chemicals               
CID000000006      NaN      NaN 
CID000000051      NaN      NaN 
CID000000085      NaN      NaN 
CID000000119      NaN      NaN 
CID000000134      NaN      NaN 
CID000000135      NaN      NaN 
CID000000174      NaN      NaN 
CID000000175      NaN      858.0 
CID000000177      760.0      NaN 

或者reindex_axis

print df1.reindex_axis(cols, axis=1) 
proteins  10116.ENSRNOP00000003921 10116.ENSRNOP00000008952 \ 
chemicals               
CID000000006      196.0      NaN 
CID000000051      246.0      NaN 
CID000000085      196.0      NaN 
CID000000119      247.0      NaN 
CID000000134      NaN      159.0 
CID000000135      NaN      157.0 
CID000000174      NaN      439.0 
CID000000175      NaN      NaN 
CID000000177      NaN      NaN 

proteins  10116.ENSRNOP00000004027 10116.ENSRNOP00000001021 
chemicals               
CID000000006      NaN      NaN 
CID000000051      NaN      NaN 
CID000000085      NaN      NaN 
CID000000119      NaN      NaN 
CID000000134      NaN      NaN 
CID000000135      NaN      NaN 
CID000000174      NaN      NaN 
CID000000175      NaN      858.0 
CID000000177      760.0      NaN 
+0

在你的代碼的第二行中的「sort_values()」中是不是缺少的東西? 我得到:'NoneType'對象沒有屬性'index' –

+0

我使用版本熊貓0.18.0。我認爲問題是如果使用舊版本。 – jezrael

0

您可以使用@ jezrael的溶液或做這種方式(這是非常相似):

In [136]: df 
Out[136]: 
     chemicals      prots scores 
0 CID000000006 10116.ENSRNOP00000003921  196 
1 CID000000051 10116.ENSRNOP00000003921  246 
2 CID000000085 10116.ENSRNOP00000003921  196 
3 CID000000119 10116.ENSRNOP00000003921  247 
4 CID000000134 10116.ENSRNOP00000008952  159 
5 CID000000135 10116.ENSRNOP00000008952  157 
6 CID000000174 10116.ENSRNOP00000008952  439 
7 CID000000175 10116.ENSRNOP00000001021  858 
8 CID000000177 10116.ENSRNOP00000004027  760 

準備正確的順序

In [169]: df.groupby('prots').sum().sort('scores', ascending=False) 
Out[169]: 
          scores 
prots 
10116.ENSRNOP00000003921  885 
10116.ENSRNOP00000001021  858 
10116.ENSRNOP00000004027  760 
10116.ENSRNOP00000008952  755 

準備排序的列的列表(舊版本大熊貓)使用的.sort()代替.sort_values()

In [170]: cols = df.groupby('prots').sum().sort_values(by='scores', ascending=False).index 

In [171]: cols 
Out[171]: 
Index(['10116.ENSRNOP00000003921', '10116.ENSRNOP00000001021', 
     '10116.ENSRNOP00000004027', '10116.ENSRNOP00000008952'], 
     dtype='object', name='prots') 

樞紐和以正確的順序設置你的專欄:

In [175]: df_rat1 = df.pivot(index='chemicals', columns='prots', values='scores').fillna('') 

In [176]: df_rat1 = df_rat1[cols] 

In [177]: df_rat1 
Out[177]: 
prots  10116.ENSRNOP00000003921 10116.ENSRNOP00000001021 10116.ENSRNOP00000004027 10116.ENSRNOP00000008952 
chemicals 
CID000000006      196 
CID000000051      246 
CID000000085      196 
CID000000119      247 
CID000000134                         159 
CID000000135                         157 
CID000000174                         439 
CID000000175            858 
CID000000177                  760 
+0

好吧,保護程序沒有在正確的順序 –

+0

@ELWalou,我已經更新了我的答案 - 請檢查 – MaxU

+0

行170總是讓我的「AttributeError:」DataFrame 'object has no attribute'sort_values'「error message –

相關問題