如何根據這些行值在一列中選擇熊貓的行值，以滿足某些條件出現在另一列的任何地方

標題令人困惑。如何根據這些行值在一列中選擇熊貓的行值，以滿足某些條件出現在另一列的任何地方

因此，假設我有一個數據幀，其中有一列，即id，它在整個數據幀中出現多次。然後我有另一個專欄，我們叫它cumulativeOccurrences。

如何選擇id的所有唯一匹配項，以使其他列滿足某個條件，例如對於該id的每個實例而言都表示cumulativeOccurrences > 20？

代碼的開始可能是這樣的：

dataframe.groupby('id')

但我想不通的休息。

下面是一個簡單的小數據集應返回零個值：

id   cumulativeOccurrences 
5494178  136 
5494178  71 
5494178  18 
5494178  83 
5494178  57 
5494178  181 
5494178  13 
5494178  10 
5494178  90 
5494178  4484

好了，這是我更得過且過左右後得到的結果：

res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]}) 
ids = res[res.cumulativeOccurrences['<lambda>']==True].index

這給了我ID的列表滿足條件。不過，對於agg函數，可能有比列表理解lambda函數更好的方法。有任何想法嗎？

來源

2017-10-28 Jeremy Schutte

你可以添加一些數據樣本和所需的輸出嗎？ – jezrael

第一過濾器，然後使用DataFrameGroupBy.all：

res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all() 
ids = res.index[res] 
print (ids) 
Int64Index([5494172], dtype='int64', name='id')

EDIT1：

首先定時非排序id和第二對分選的。

np.random.seed(123) 
N = 10000000 

df = pd.DataFrame({'id': np.random.randint(1000, size=N), 
        'cumulativeOccurrences':np.random.randint(19,5000,size=N)}, 
        columns=['id','cumulativeOccurrences']) 
print (df.head())

In [125]: %%timeit 
    ...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all() 
    ...: ids = res.index[res] 
    ...: 
1 loop, best of 3: 1.22 s per loop 

In [126]: %%timeit 
    ...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]}) 
    ...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index 
    ...: 
1 loop, best of 3: 3.69 s per loop 

In [127]: %timeit 

In [128]: %%timeit 
    ...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x])) 
    ...: ids = res.index[res] 
    ...: 
1 loop, best of 3: 3.63 s per loop

np.random.seed(123) 
N = 10000000 

df = pd.DataFrame({'id': np.random.randint(1000, size=N), 
        'cumulativeOccurrences':np.random.randint(19,5000,size=N)}, 
        columns=['id','cumulativeOccurrences']).sort_values('id').reset_index(drop=True) 
print (df.head())

In [130]: %%timeit 
    ...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all() 
    ...: ids = res.index[res] 
    ...: 
1 loop, best of 3: 795 ms per loop 

In [131]: %%timeit 
    ...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]}) 
    ...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index 
    ...: 
1 loop, best of 3: 3.23 s per loop 

In [132]: %%timeit 
    ...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x])) 
    ...: ids = res.index[res] 
    ...: 
1 loop, best of 3: 3.15 s per loop

結論 - 排序id和獨特的索引可以提高性能。還有數據在版本python 3下測試。

來源

2017-10-28 17:08:42 jezrael

此篩選器對至少有一個cumulativeOccurence的值超過20的ID進行過濾。我試圖將其過濾爲使特定ID的所有cumulativeOccurence都超過20. –

感謝您提供數據，我編輯答案。 – jezrael

嘿 - 這個作品，我會選擇它作爲答案，謝謝。我想知道是否有什麼辦法可以更快地做到這一點，因爲它對於我的數據集（大約4200萬）來說非常慢。 –

如何根據這些行值在一列中選擇熊貓的行值，以滿足某些條件出現在另一列的任何地方

回答

相關問題