2016-12-21 99 views
2

我有一個每行重複3次的數據幀。在循環的過程中,如何確定是否以前看過一行,然後執行某些操作,即在循環的第二次出現處打印某些內容?在循環數據幀時計算行的發生次數

print df 
     user  date 
0  User001 2014-11-01 
40  User001 2014-11-01 
80  User001 2014-11-01 
120 User001 2014-11-08 
200 User001 2014-11-08 
160 User001 2014-11-08 
280 User001 2014-11-15 
240 User001 2014-11-15 
320 User001 2014-11-15 
400 User001 2014-11-22 
440 User001 2014-11-22 
360 User001 2014-11-22 
... ...... .......... 
... ...... .......... 
1300 User008 2014-11-22 
1341 User008 2014-11-22 
1360 User008 2014-11-22 

for line in df.itertuples(): 
    user = line[1] 
    date = line[2] 

    print user, date 
    #do something after second occurrence of tuple i.e. print "second occurrence" 

('User001', '2014-11-01') 
('User001', '2014-11-01') 
second occurrence 
('User001', '2014-11-01') 
('User001', '2014-11-08') 
('User001', '2014-11-08') 
second occurrence 
('User001', '2014-11-08') 
('User001', '2014-11-15') 
('User001', '2014-11-15') 
second occurrence 
('User001', '2014-11-15') 
('User001', '2014-11-22') 
('User001', '2014-11-22') 
second occurrence 
('User001', '2014-11-22') 
('User008', '2014-11-22') 
('User008', '2014-11-22') 
second occurrence 
('User008', '2014-11-22') 

回答

2

可以使用cumcount爲找到第二occurence的所有指標:

mask = df.groupby(['user', 'date']).cumcount() == 1 
idx = mask[mask].index 
print (idx) 
Int64Index([40, 200, 240, 440], dtype='int64') 
for line in df.itertuples(): 
    print (line.user) 
    print (line.date) 
    if line.Index in idx: 
     print ('second occurrence') 

User001 
2014-11-01 
User001 
2014-11-01 
second occurrence 
User001 
2014-11-01 
User001 
2014-11-08 
User001 
2014-11-08 
second occurrence 
User001 
2014-11-08 
User001 
2014-11-15 
User001 
2014-11-15 
second occurrence 
User001 
2014-11-15 
User001 
2014-11-22 
User001 
2014-11-22 
second occurrence 
User001 
2014-11-22 

用於查找索引另一種解決方案是:

idx = df[df.duplicated(['user', 'date']) & 
     df.duplicated(['user', 'date'], keep='last')].index 
print (idx) 
Int64Index([40, 200, 240, 440], dtype='int64') 
1

我會建議使用DataFrame.duplicated() method得到一個布爾指數識別重複的行。

根據您想如何顯示重複,你可以以不同的方式使用它,但如果你想遍歷行和打印爲每一個它是一個重複的通知,這樣的事情可能工作:

duplicate_index = df.duplicates() 
for row, dupl in zip(df, duplicate_index): 
    print(row[0], row[1]) 
    if dupl: 
     print('second occurrence') 
1

使用Counter跟蹤

from collections import Counter 

seen = Counter() 
for i, row in df.iterrows(): 
    tup = tuple(row.values.tolist()) 
    if seen[tup] == 1: 
     print(tup, ' second occurence') 
    else: 
     print(tup) 
    seen.update([tup]) 

('User001', '2014-11-01') 
('User001', '2014-11-01') second occurence 
('User001', '2014-11-01') 
('User001', '2014-11-08') 
('User001', '2014-11-08') second occurence 
('User001', '2014-11-08') 
('User001', '2014-11-15') 
('User001', '2014-11-15') second occurence 
('User001', '2014-11-15') 
('User001', '2014-11-22') 
('User001', '2014-11-22') second occurence 
('User001', '2014-11-22') 
('User008', '2014-11-22') 
('User008', '2014-11-22') second occurence 
('User008', '2014-11-22')