可以使用duplicated
與參數keep=False
所有副本的回報面具 - 2多TICKET
值,過濾器由boolean indexing
,然後通過loc
列Client
選擇和這個面具得到的值:
print (df.TICKET.duplicated(keep=False))
0 False
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 False
9 True
10 True
11 True
12 True
Name: TICKET, dtype: bool
print (df.loc[df.TICKET.duplicated(keep=False), 'Client'])
2 14613
3 36735
4 43733
6 24456
7 27919
9 14613
10 31725
11 37547
12 43733
Name: Client, dtype: int64
然後value_counts
並根據需要過濾boolean indexing
再次過濾:
s = df.loc[df.TICKET.duplicated(keep=False), 'Client'].value_counts()
print (s)
43733 2
14613 2
36735 1
31725 1
37547 1
24456 1
27919 1
Name: Client, dtype: int64
print (s[s > 1])
43733 2
14613 2
Name: Client, dtype: int64
如果需要,最後加上reset_index
的轉換Series
到DataFrame
:
df1 = s[s > 1].reset_index()
df1.columns = ['Client','Count']
print (df1)
Client Count
0 43733 2
1 14613 2
解決方案與filtration
是slowier:
s = df.groupby('TICKET').filter(lambda x: len(x) > 1)['Client'].value_counts()
print (s)
43733 2
14613 2
36735 1
31725 1
37547 1
24456 1
27919 1
Name: Client, dtype: int64
In [46]: %timeit (df.loc[df.TICKET.duplicated(keep=False), 'Client'].value_counts())
1000 loops, best of 3: 769 µs per loop
In [47]: %timeit (df.groupby('TICKET').filter(lambda x: len(x) > 1)['Client'].value_counts())
100 loops, best of 3: 2.55 ms per loop
#[1300000 rows x 2 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
#print (df)
In [53]: %timeit (df.loc[df.TICKET.duplicated(keep=False), 'Client'].value_counts())
10 loops, best of 3: 54.8 ms per loop
In [54]: %timeit (df.groupby('TICKET').filter(lambda x: len(x) > 1)['Client'].value_counts())
1 loop, best of 3: 282 ms per loop
Hummm ..實際上,我想要計算兩個客戶之間的事件。正如圖像示例(https://i.stack.imgur.com/PMWay.png)中一樣,客戶端14613和43733同時出現在兩個票單中,分兩次出現。 – EnigmA