2015-09-01 38 views
1

假設我有一個熊貓DataFrame。如何設置(1)熊貓數據框中的最大元素和(0)其他所有內容?

df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6)) 

DF:

 
     a   b   c   d   e   f 
0 -1.238393 -0.755117 -0.228638 -0.077966 0.412947 0.887955 
1 -0.342087 0.296171 0.177956 0.701668 -0.481744 -1.564719 
2 0.610141 0.963873 -0.943182 -0.341902 0.326416 0.818899 
3 -0.561572 0.063588 -0.195256 -1.637753 0.622627 0.845801 
4 -2.506322 -1.631023 0.506860 0.368958 1.833260 0.623055 
5 -1.313919 -1.758250 -1.082072 1.266158 0.427079 -1.018416 
6 -0.781842 1.270133 -0.510879 -1.438487 -1.101213 -0.922821 
7 -0.456999 0.234084 1.602635 0.611378 -1.147994 1.204318 
8 0.497074 0.412695 -0.458227 0.431758 0.514382 -0.479150 
9 -1.289392 -0.218624 0.122060 2.000832 -1.694544 0.773330 

,我怎麼把設置1至橫行max和0其他元素?

我想出了:

>>> for i in range(len(df)): 
...  df.loc[i][df.loc[i].idxmax(axis=1)] = 1 
...  df.loc[i][df.loc[i] != 1] = 0 

產生 DF:

 
    a b c d e f 
0 0 0 0 0 0 1 
1 0 0 0 1 0 0 
2 0 1 0 0 0 0 
3 0 0 0 0 0 1 
4 0 0 0 0 1 0 
5 0 0 0 1 0 0 
6 0 1 0 0 0 0 
7 0 0 1 0 0 0 
8 0 0 0 0 1 0 
9 0 0 0 1 0 0 

有沒有人有這樣做的更好的辦法?可能是通過擺脫for循環或應用lambda?

回答

0
import numpy as np 


def max_binary(df): 
     binary = np.where(df == df.max() , 1 , 0) 
     return binary 


df.apply(max_binary , axis = 1) 
+0

這很好。謝謝。 –

+0

@RaihanMasud很高興它有幫助,你可以檢查答案,以確認它與你合作,這個答案的左邊這個真正的標誌 –

0

繼納德的模式,這是一個較短的版本:

df.apply(lambda x: np.where(x == x.max() , 1 , 0) , axis = 1) 
0

使用max並檢查使用eq平等和投布爾DF使用astype爲int,這將轉化TrueFalse10

In [21]: 
df = pd.DataFrame(index = [ix for ix in range(10)], columns=list('abcdef'), data=np.random.randn(10,6)) 
df 

Out[21]: 
      a   b   c   d   e   f 
0 0.797000 0.762125 -0.330518 1.117972 0.817524 0.041670 
1 0.517940 0.357369 -1.493552 -0.947396 3.082828 0.578126 
2 1.784856 0.672902 -1.359771 -0.090880 -0.093100 1.099017 
3 -0.493976 -0.390801 -0.521017 1.221517 -1.303020 1.196718 
4 0.687499 -2.371322 -2.474101 -0.397071 0.132205 0.034631 
5 0.573694 -0.206627 -0.106312 -0.661391 -0.257711 -0.875501 
6 -0.415331 1.185901 1.173457 0.317577 -0.408544 -1.055770 
7 -1.564962 -0.408390 -1.372104 -1.117561 -1.262086 -1.664516 
8 -0.987306 0.738833 -1.207124 0.738084 1.118205 -0.899086 
9 0.282800 -1.226499 1.658416 -0.381222 1.067296 -1.249829 

In [22]: 
df = df.eq(df.max(axis=1), axis=0).astype(int) 
df 

Out[22]: 
    a b c d e f 
0 0 0 0 1 0 0 
1 0 0 0 0 1 0 
2 1 0 0 0 0 0 
3 0 0 0 1 0 0 
4 1 0 0 0 0 0 
5 1 0 0 0 0 0 
6 0 1 0 0 0 0 
7 0 1 0 0 0 0 
8 0 0 0 0 1 0 
9 0 0 1 0 0 0 

個計時

In [24]: 
# @Raihan Masud's method 
%timeit df.apply(lambda x: np.where(x == x.max() , 1 , 0) , axis = 1) 
# mine 
%timeit df.eq(df.max(axis=1), axis=0).astype(int) 
100 loops, best of 3: 7.94 ms per loop 
1000 loops, best of 3: 640 µs per loop 

In [25]: 
# @Nader Hisham's method 
%%timeit 
def max_binary(df): 
    binary = np.where(df == df.max() , 1 , 0) 
    return binary 
​ 
df.apply(max_binary , axis = 1) 
100 loops, best of 3: 9.63 ms per loop 

你可以看到,我的方法是不是@ Raihan的方法

In [4]: 
%%timeit 
for i in range(len(df)): 
    df.loc[i][df.loc[i].idxmax(axis=1)] = 1 
    df.loc[i][df.loc[i] != 1] = 0 

10 loops, best of 3: 21.1 ms per loop 

快12倍以上的for循環也顯著慢

+0

謝謝@EdChum。你嘗試了我的原始文章嗎?我有興趣知道這個人與你相比多久了? (len(df)):... df.loc [i] [df.loc [i] .idxmax(axis = 1)] = 1 ... df.loc [i] [df。 loc [i]!= 1] = 0' –

+0

我編輯過帖子,使用for循環是最慢的方法 – EdChum

相關問題