2014-05-25 280 views
0

我有一個CSV文件看起來像這樣:Python的 - 每月彙總並計算平均

Date,Sentiment 
2014-01-03,0.4 
2014-01-04,-0.03 
2014-01-09,0.0 
2014-01-10,0.07 
2014-01-12,0.0 
2014-02-24,0.0 
2014-02-25,0.0 
2014-02-25,0.0 
2014-02-26,0.0 
2014-02-28,0.0 
2014-03-01,0.1 
2014-03-02,-0.5 
2014-03-03,0.0 
2014-03-08,-0.06 
2014-03-11,-0.13 
2014-03-22,0.0 
2014-03-23,0.33 
2014-03-23,0.3 
2014-03-25,-0.14 
2014-03-28,-0.25 
etc 

我的目標是月彙總的日期和計算月平均。日期可能不會以1或1月開始。問題是我有很多數據,這意味着我有更多年。爲此,我想找到最快的日期(月),並從那裏開始計算月份和平均值。例如:

Month count, average 
1, 0.4 (<= the earliest month) 
2, -0.3 
3, 0.0 
... 
12, 0.1 
13, -0.4 (<= new year but counting of month is continuing) 
14, 0.3 

我用熊貓來打開CSV

data = pd.read_csv("pks.csv", sep=",") 

所以在data['Date']我有日期和data['Sentiment']我有值。任何想法如何做到這一點?

回答

3

可能最簡單的方法是使用resample命令。首先,在閱讀數據時,確保解析日期並將日期列設置爲索引(忽略StringIO部分和標題= True ...我正在從多行字符串讀取示例數據):

>>> df = pd.read_csv(StringIO(data),header=True,parse_dates=['Date'], 
        index_col='Date') 
>>> df 

      Sentiment 
Date 
2014-01-03  0.40 
2014-01-04  -0.03 
2014-01-09  0.00 
2014-01-10  0.07 
2014-01-12  0.00 
2014-02-24  0.00 
2014-02-25  0.00 
2014-02-25  0.00 
2014-02-26  0.00 
2014-02-28  0.00 
2014-03-01  0.10 
2014-03-02  -0.50 
2014-03-03  0.00 
2014-03-08  -0.06 
2014-03-11  -0.13 
2014-03-22  0.00 
2014-03-23  0.33 
2014-03-23  0.30 
2014-03-25  -0.14 
2014-03-28  -0.25 


>>> df.resample('M',how='mean') 

      Sentiment 
2014-01-31  0.088 
2014-02-28  0.000 
2014-03-31  -0.035 

如果你想一個月櫃檯,以後還可以將其添加您的resample

>>> agg = df.resample('M',how='mean') 
>>> agg['cnt'] = range(len(agg)) 
>>> agg 

      Sentiment cnt 
2014-01-31  0.088 0 
2014-02-28  0.000 1 
2014-03-31  -0.035 2 

您也可以用groupby方法和TimeGrouper功能(組由一個月,然後調用做到這一點平均便利方法,可用於groupby)。

>>> df.groupby(pd.TimeGrouper(freq='M')).mean() 

      Sentiment 
2014-01-31  0.088 
2014-02-28  0.000 
2014-03-31  -0.035 
+0

太棒了,那正是我需要的。非常感謝你! –