在大熊貓中按日期分組後計數觀測值

當時間戳是非唯一的時，在Pandas DataFrame中按日期計算觀測值的最佳方法是什麼？在大熊貓中按日期分組後計數觀測值

df = pd.DataFrame({'User' : ['A', 'B', 'C'] * 40, 
        'Value' : np.random.randn(120), 
        'Time' : [np.random.choice(pd.date_range(datetime.datetime(2013,1,1,0,0,0),datetime.datetime(2013,1,3,0,0,0),freq='H')) for i in range(120)]})

理想情況下，輸出將提供每天（或某些其他更高階的時間單位）的觀測次數。這可以用來隨着時間的推移繪製活動。

2013-01-01  60 
2013-01-02  60

來源

2014-01-24 Brian Keegan

這樣做的「非熊貓-IC」的方式將使用一個計數器對象的一系列轉化爲日期日期時間的，將這個計數器回系列，並強迫在這個系列指數日期時間。

In[1]: from collections import Counter 
In[2]: counted_dates = Counter(df['Time'].apply(lambda x: x.date())) 
In[3]: counted_series = pd.Series(counted_dates) 
In[4]: counted_series.index = pd.to_datetime(counted_series.index) 
In[5]: counted_series 
Out[5]: 
2013-01-01  60 
2013-01-02  60

更「熊貓式」的方式是使用groupby操作系列，然後按長度聚合輸出。

In[1]: grouped_dates = df.groupby(df['Time'].apply(lambda x : x.date())) 
In[2]: grouped_dates['Time'].aggregate(len) 
Out[2]: 
2013-01-01  60 
2013-01-02  60

編輯：另一種非常簡潔的可能性，從here借是使用nunique類：

In[1]: df.groupby(df['Time'].apply(lambda x : x.date())).agg({'Time':pd.Series.nunique}) 
Out[1]: 
2013-01-01  60 
2013-01-02  60

此外風格的差異，也一個比其他有顯著的性能優勢？我忽略了其他內置的方法嗎？

來源

2014-01-24 22:08:58

編輯：另一個解決方案是更快的是使用value_counts（和normalize）：

In [41]: %timeit df1 = df.set_index('Time'); pd.value_counts(df1.index.normalize(), sort=False) 
1000 loops, best of 3: 586 µs per loop

我還以爲這是更簡明地寫成resample，如果您使用的是DatetimeIndex：
然而，它似乎要慢得多，並且（令人驚訝的是）Counter解決方案是最快的！

In [11]: df1 = df.set_index('Time') 

In [12]: df1.User.resample('D', how=len) 
Out[12]: 
Time 
2013-01-01 59 
2013-01-02 58 
2013-01-03  3 
Freq: D, Name: User, dtype: int64

這總是值得檢查這些一些timeits：

In [21]: %timeit df1.User.resample('D', how=len) 
1000 loops, best of 3: 720 µs per loop

不幸的是，使這更加昂貴：

In [22]: %timeit df1 = df.set_index('Time'); df1.User.resample('D', how=len) 
1000 loops, best of 3: 1.1 ms per loop

比較：

In [23]: %%timeit 
    ....: grouped_dates = df.groupby(df['Time'].apply(lambda x : x.date())) 
    ....: grouped_dates['Time'].aggregate(len) 
    ....: 
1000 loops, best of 3: 788 µs per loop 

In [24]: %%timeit 
    ....: counted_dates = Counter(df['Time'].apply(lambda x: x.date())) 
    ....: counted_series = pd.Series(counted_dates) 
    ....: counted_series.index = pd.to_datetime(counted_series.index) 
    ....: 
1000 loops, best of 3: 568 µs per loop

我懷疑更多的約會，這將是不同的......

In [31]: df = pd.DataFrame({'User' : ['A', 'B', 'C'] * 400, 
        'Value' : np.random.randn(1200), 
        'Time' : [np.random.choice(pd.date_range(datetime.datetime(1992,1,1,0,0,0),datetime.datetime(2014,1,1,0,0,0),freq='H')) for i in range(1200)]}) 

In [32]: %timeit df1 = df.set_index('Time'); df1.User.resample('D', how=len) 
10 loops, best of 3: 28.7 ms per loop 

In [33]: %%timeit     
    ....: grouped_dates = df.groupby(df['Time'].apply(lambda x : x.date())) 
    ....: grouped_dates['Time'].aggregate(len) 
    ....: 
100 loops, best of 3: 6.82 ms per loop 

In [34]: %%timeit     
    ....: counted_dates = Counter(df['Time'].apply(lambda x: x.date())) 
    ....: counted_series = pd.Series(counted_dates) 
    ....: counted_series.index = pd.to_datetime(counted_series.index) 
    ....: 
100 loops, best of 3: 3.04 ms per loop

但仍反勝...！

編輯：而是由value_counts砸了：

In [42]: %timeit df1 = df.set_index('Time'); pd.value_counts(df1.index.normalize(), sort=False) 
1000 loops, best of 3: 989 µs per loop

來源

2014-01-24 22:44:30

LEN（Series.unique（））可能會更快。

在我的電腦：

%timeit df1 = df.set_index('Time'); pd.value_counts(df1.index.normalize(), sort=False) 
1000 loops, best of 3: 2.06 ms per loop

而

%timeit df1 = df.set_index('Time'); len(df1.index.normalize().unique()) 
1000 loops, best of 3: 1.04 ms per loop

有趣的是，LEN（Series.unique（））通常比Series.nunique（）快得多。對於多達x000個物品的小型陣列，速度提高10-15倍，對於有數百萬個物品的大型陣列，速度提高3-4倍。

來源

2014-05-02 23:10:52

在大熊貓中按日期分組後計數觀測值

回答

相關問題