2015-11-30 559 views
4

必須有一種簡單的方法才能做到這一點,但我無法找到一個適合SO的優雅解決方案,或者自己解決這個問題。計算Pandas DataFrame中的重複值

我想根據DataFrame中的一組列來計算重複值的數量。

例子:

print df 

    Month LSOA code Longitude Latitude Crime type 
0 2015-01 E01000916 -0.106453 51.518207 Bicycle theft 
1 2015-01 E01000914 -0.111497 51.518226 Burglary 
2 2015-01 E01000914 -0.111497 51.518226 Burglary 
3 2015-01 E01000914 -0.111497 51.518226 Other theft 
4 2015-01 E01000914 -0.113767 51.517372 Theft from the person 

我的解決方法:

counts = dict() 
for i, row in df.iterrows(): 
    key = (
      row['Longitude'], 
      row['Latitude'], 
      row['Crime type'] 
     ) 

    if counts.has_key(key): 
     counts[key] = counts[key] + 1 
    else: 
     counts[key] = 1 

而且我得到的計數:

{(-0.11376700000000001, 51.517371999999995, 'Theft from the person'): 1, 
(-0.111497, 51.518226, 'Burglary'): 2, 
(-0.111497, 51.518226, 'Other theft'): 1, 
(-0.10645299999999999, 51.518207000000004, 'Bicycle theft'): 1} 
從實際上這個代碼可以作爲很好的改善

除了(隨意如何評論),通過熊貓做到這一點的方式是什麼?

對於那些有興趣,我工作的一個數據集從https://data.police.uk/

回答

8

可以使用groupby與功能size。 然後,我重新將索引重新命名爲0count

print df 
    Month LSOA  code Longitude Latitude    Crime type 
0 2015-01 E01000916 -0.106453 51.518207   Bicycle theft 
1 2015-01 E01000914 -0.111497 51.518226    Burglary 
2 2015-01 E01000914 -0.111497 51.518226    Burglary 
3 2015-01 E01000914 -0.111497 51.518226   Other theft 
4 2015-01 E01000914 -0.113767 51.517372 Theft from the person 

df = df.groupby(['Longitude', 'Latitude', 'Crime type']).size().reset_index(name='count') 
print df 
    Longitude Latitude    Crime type count 
0 -0.113767 51.517372 Theft from the person  1 
1 -0.111497 51.518226    Burglary  2 
2 -0.111497 51.518226   Other theft  1 
3 -0.106453 51.518207   Bicycle theft  1 

print df['count'] 
0 1 
1 2 
2 1 
3 1 
Name: count, dtype: int64