第一過濾器,然後使用DataFrameGroupBy.all
:
res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
ids = res.index[res]
print (ids)
Int64Index([5494172], dtype='int64', name='id')
EDIT1:
首先定時非排序id
和第二對分選的。
np.random.seed(123)
N = 10000000
df = pd.DataFrame({'id': np.random.randint(1000, size=N),
'cumulativeOccurrences':np.random.randint(19,5000,size=N)},
columns=['id','cumulativeOccurrences'])
print (df.head())
In [125]: %%timeit
...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
...: ids = res.index[res]
...:
1 loop, best of 3: 1.22 s per loop
In [126]: %%timeit
...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
...:
1 loop, best of 3: 3.69 s per loop
In [127]: %timeit
In [128]: %%timeit
...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x]))
...: ids = res.index[res]
...:
1 loop, best of 3: 3.63 s per loop
np.random.seed(123)
N = 10000000
df = pd.DataFrame({'id': np.random.randint(1000, size=N),
'cumulativeOccurrences':np.random.randint(19,5000,size=N)},
columns=['id','cumulativeOccurrences']).sort_values('id').reset_index(drop=True)
print (df.head())
In [130]: %%timeit
...: res = (df['cumulativeOccurrences'] > 20).groupby(df['id']).all()
...: ids = res.index[res]
...:
1 loop, best of 3: 795 ms per loop
In [131]: %%timeit
...: res = df[['id','cumulativeOccurrences']].groupby(['id']).agg({'cumulativeOccurrences':[lambda x: all([e > 20 for e in x])]})
...: ids = res[res.cumulativeOccurrences['<lambda>']==True].index
...:
1 loop, best of 3: 3.23 s per loop
In [132]: %%timeit
...: res = df['cumulativeOccurrences'].groupby(df['id']).agg(lambda x: all([e > 20 for e in x]))
...: ids = res.index[res]
...:
1 loop, best of 3: 3.15 s per loop
結論 - 排序id
和獨特的索引可以提高性能。還有數據在版本python 3
下測試。
你可以添加一些數據樣本和所需的輸出嗎? – jezrael