2016-02-09 104 views
2

我有一個DF像這樣(數據表示矩陣):移調數據幀和排序

  Arnston Berg Carlson 
Arnston 0.00  1.00 2.00 
Berg  1.00  0.00 3.00 
Carlson 2.00  3.00 0.00 

,我想移調它使行和列名被鏈接,並且顯示它們相關聯的值作爲一個新的列,從最小到最大排序。我只需要保持行列組合之一,因爲他們總是相同的(例如Arnston,伯格== 1.00和Berg,Arnston == 1.00)

我所需的輸出是:

Arnston, Arnston 0.00 
Berg, Berg   0.00 
Carlson, Carlson 0.00 
Arnston, Berg  1.00 
Arnston, Carlson 2.00 
Berg, Carlson  3.00 

我希望這是有道理的。

回答

4

大熊貓melt功能是真棒。

在:

df = df.reset_index() #Make your index into a column 
df = pd.melt(df, id_vars = ['index']) #Reshape data 
df = df[df['index'] <= df['variable']].sort_values(by = 'value') #Remove duplicates, sort 
df ['col'] = df['index'] +','+ df['variable'] #Concatenate strings 
df = df[['col','value']] #Remove unnecessary columns 
df = df.set_index('col') #Set strings to index 
df 

日期:

   value 
col 
Arnston,Arnston 0 
Berg,Berg  0 
Carlson,Carlson 0 
Arnston,Berg 1 
Arnston,Carlson 2 
Berg,Carlson 3 
0

我假設你的矩陣是對稱的,所以你可以使用嵌套循環建立一個索引列表和上對角矩陣的值列表。然而,第二個循環應該從內部循環的值開始。

vals = [] 
idx = [] 
for i in range(df.shape[0]): 
    for j in range(i, df.shape[1]): 
     idx.append((df.index[i], df.columns[j])) 
     vals.append(df.iat[i, j]) 
>>> pd.Series(vals, index=idx) 
(Arnston, Arnston) 0 
(Arnston, Berg)  1 
(Arnston, Carlson) 2 
(Berg, Berg)   0 
(Berg, Carlson)  3 
(Carlson, Carlson) 0 
dtype: float64 

爲了給出一些定時比較:

dfc = df.copy() 

# Nested loop. 
%%timeit 
vals = [] 
idx = [] 
for i in range(dfc.shape[0]): 
    for j in range(i, dfc.shape[1]): 
     idx.append((dfc.index[i], dfc.columns[j])) 
     vals.append(dfc.iat[i, j]) 
pd.Series(vals, index=idx) 
1000 loops, best of 3: 187 µs per loop 

# Melt. 
%%timeit 
df = dfc.reset_index() 
df = pd.melt(df,id_vars=['index']) 
df = df[df['index']<=df['variable']].sort_values(by='value') 
df ['col'] = df['index'] +','+ df['variable'] 
df = df[['col','value']] 
df = df.set_index('col') 
100 loops, best of 3: 3.39 ms per loop 

定時被反向放大100×100對稱矩陣,其中melt熔化競爭:

df = pd.DataFrame(np.random.randn(100, 100)) 
for i in range(df.shape[0]): 
    df.iat[i, i] = 1 
    for j in range(i + 1, df.shape[1]): 
     df.iat[i, j] = df.iat[j, i] 
df.columns = df.index = ['col_' + str(i) for i in range(100)] 
dfc = df.copy() 

# nested loop: 
10 loops, best of 3: 55.2 ms per loop 

# melt: 
100 loops, best of 3: 5.72 ms per loop 
0

下面是一個使用numpy

%%timeit 
df = pd.DataFrame([['Arnston', 0.0, 1.0, 2.0], 
       ['Berg', 1.0, 0.0, 3.0], 
       ['Carlson', 2.0, 3.0, 0.0]], 
       columns=['Name','Arnston','Berg','Carlson']) 

df.set_index('Name', inplace=True) 

upper = np.triu_indices_from(df.as_matrix()) #indices from upper tri 
vals = df.as_matrix()[upper] #vals at upper inds 
idx = [(df.index[i], df.columns[j]) for i,j in zip(upper[0],upper[1])] 

# w/ numpy 
1000 loops, best of 3: 810 µs per loop 

結果:

In [11]: pd.Series(vals, index=idx) 
Out[11]:  
     (Arnston, Arnston) 0 
     (Arnston, Berg)  1 
     (Arnston, Carlson) 2 
     (Berg, Berg)   0 
     (Berg, Carlson)  3 
     (Carlson, Carlson) 0 
     dtype: float64 

當您在亞歷山大的大dfc運行:

%%timeit 
upper = np.triu_indices_from(dfc.as_matrix()) #indices from upper tri 
vals = dfc.as_matrix()[upper] #vals at upper inds 
idx = [(dfc.index[i], dfc.columns[j]) for i,j in zip(upper[0],upper[1])] 

100 loops, best of 3: 15.3 ms per loop 

不太一樣快melt