得到數據框中片與列名的列表，並不是所有列在數據幀

考慮df得到數據框中片與列名的列表，並不是所有列在數據幀

df = pd.DataFrame(np.ones((2, 3)), columns=list('abc')) 
df

col_list = list('bcd') 

df[col_list]

產生一個錯誤

KeyError: "['d'] not in index"

如何獲得儘可能多的色譜柱？

來源

2016-10-25 piRSquared

有關使用Index.intersection()什麼？

In [69]: df[df.columns.intersection(col_list)] 
Out[69]: 
    b c 
0 1.0 1.0 
1 1.0 1.0 

In [70]: df.columns 
Out[70]: Index(['a', 'b', 'c'], dtype='object') # <---------- Index

時間：

In [21]: df_ = pd.concat([df] * 10**5, ignore_index=True) 

In [22]: df_.shape 
Out[22]: (200000, 3) 

In [23]: df.columns 
Out[23]: Index(['a', 'b', 'c'], dtype='object') 

In [24]: col_list = list('bcd') 

In [28]: %timeit df_[df_.columns.intersection(col_list)] 
100 loops, best of 3: 6.24 ms per loop 

In [29]: %timeit df_[[col for col in col_list if col in df_.columns]] 
100 loops, best of 3: 5.69 ms per loop

讓我們來測試它調換DF（3行，200K列）：

In [30]: t = df_.T 

In [31]: t.shape 
Out[31]: (3, 200000) 

In [32]: t 
Out[32]: 
    0  1  2  3  4  ... 199995 199996 199997 199998 199999 
a  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 
b  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 
c  1.0  1.0  1.0  1.0  1.0 ...  1.0  1.0  1.0  1.0  1.0 

[3 rows x 200000 columns] 

In [33]: col_list=[-10, -20, 10, 20, 100] 

In [34]: %timeit t[t.columns.intersection(col_list)] 
10 loops, best of 3: 52.8 ms per loop 

In [35]: %timeit t[[col for col in col_list if col in t.columns]] 
10 loops, best of 3: 103 ms per loop

結論：幾乎總是列表理解贏得了小名單和熊貓/ NumPy贏得更大的數據集...

來源

2016-10-25 17:30:42 MaxU

我忘了廣泛的測試... – piRSquared

如何：

df[[col for col in list('bcd') if col in df.columns]]

這產生了：

 b c 
0 1.0 1.0 
1 1.0 1.0

來源

2016-10-25 17:40:51

Index對象支持isin：

In [4]:  
col_list = list('bcd') 
df.ix[:,df.columns.isin(col_list)] 

Out[4]: 
    b c 
0 1 1 
1 1 1

因此，這將產生現有列的反對傳遞一個布爾面具列表

計時

In [5]: 
df_ = pd.concat([df] * 10**5, ignore_index=True) 
%timeit df_[df_.columns.intersection(col_list)] 
%timeit df_[[col for col in col_list if col in df_.columns]] 
%timeit df_.ix[:,df_.columns.isin(col_list)] 

100 loops, best of 3: 12.8 ms per loop 
100 loops, best of 3: 18.6 ms per loop 
10 loops, best of 3: 26.6 ms per loop

這是最慢的方法，但其更少的字符，也許更容易理解

來源

2016-10-25 22:48:27 EdChum

我問這個問題，因爲它是一個那些煩人的事情，讓我當我開始使用熊貓。我認爲這個答案是非常有用的，我懷疑很多人會選擇它。 – piRSquared

得到數據框中片與列名的列表，並不是所有列在數據幀

回答

相關問題