2016-10-25 45 views

回答

5

有關使用Index.intersection()什麼?

In [69]: df[df.columns.intersection(col_list)] 
Out[69]: 
    b c 
0 1.0 1.0 
1 1.0 1.0 

In [70]: df.columns 
Out[70]: Index(['a', 'b', 'c'], dtype='object') # <---------- Index 

時間:

In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True) 

In [22]: df_.shape 
Out[22]: (200000, 3) 

In [23]: df.columns 
Out[23]: Index(['a', 'b', 'c'], dtype='object') 

In [24]: col_list = list('bcd') 

In [28]: %timeit df_[df_.columns.intersection(col_list)] 
100 loops, best of 3: 6.24 ms per loop 

In [29]: %timeit df_[[col for col in col_list if col in df_.columns]] 
100 loops, best of 3: 5.69 ms per loop 

讓我們來測試它調換DF(3行,200K列):

In [30]: t = df_.T 

In [31]: t.shape 
Out[31]: (3, 200000) 

In [32]: t 
Out[32]: 
    0  1  2  3  4  ... 199995 199996 199997 199998 199999 
a  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 
b  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 
c  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 

[3 rows x 200000 columns] 

In [33]: col_list=[-10, -20, 10, 20, 100] 

In [34]: %timeit t[t.columns.intersection(col_list)] 
10 loops, best of 3: 52.8 ms per loop 

In [35]: %timeit t[[col for col in col_list if col in t.columns]] 
10 loops, best of 3: 103 ms per loop 

結論:幾乎總是列表理解贏得了小名單和熊貓/ NumPy贏得更大的數據集...

+1

我忘了廣泛的測試... – piRSquared

5

如何:

df[[col for col in list('bcd') if col in df.columns]] 

這產生了:

 b c 
0 1.0 1.0 
1 1.0 1.0 
1

Index對象支持isin

In [4]:  
col_list = list('bcd') 
df.ix[:,df.columns.isin(col_list)] 

Out[4]: 
    b c 
0 1 1 
1 1 1 

因此,這將產生現有列的反對傳遞一個布爾面具列表

計時

In [5]: 
df_ = pd.concat([df] * 10**5, ignore_index=True) 
%timeit df_[df_.columns.intersection(col_list)] 
%timeit df_[[col for col in col_list if col in df_.columns]] 
%timeit df_.ix[:,df_.columns.isin(col_list)] 

100 loops, best of 3: 12.8 ms per loop 
100 loops, best of 3: 18.6 ms per loop 
10 loops, best of 3: 26.6 ms per loop 

這是最慢的方法,但其更少的字符,也許更容易理解

+0

我問這個問題,因爲它是一個那些煩人的事情,讓我當我開始使用熊貓。我認爲這個答案是非常有用的,我懷疑很多人會選擇它。 – piRSquared

相關問題