2014-01-09 38 views
2

我的下一個數據幀中的大熊貓意味着:如何獲得每月使用GROUPBY

data=pd.read_csv('anual.csv', parse_dates='Fecha', index_col=0) 
data 

DatetimeIndex: 290 entries, 2011-01-01 00:00:00 to 2011-12-31 00:00:00 
Data columns (total 12 columns): 
HR    290 non-null values 
PreciAcu  290 non-null values 
RadSolar  290 non-null values 
T    290 non-null values 
Presion  290 non-null values 
Tmax   290 non-null values 
HRmax   290 non-null values 
Presionmax  290 non-null values 
RadSolarmax 290 non-null values 
Tmin   290 non-null values 
HRmin   290 non-null values 
Presionmin  290 non-null values 
dtypes: float64(4), int64(8) 

其中:

data['HR'] 

Fecha 
2011-01-01 37 
2011-02-01 70 
2011-03-01 62 
2011-04-01 69 
2011-05-01 72 
2011-06-01 71 
2011-07-01 71 
2011-08-01 70 
2011-09-01 40 
... 
2011-12-17 92 
2011-12-18 78 
2011-12-19 79 
2011-12-20 76 
2011-12-21 78 
2011-12-22 80 
2011-12-23 72 
2011-12-24 70 

此外,某些月份並不總是完整的。我的目標是根據每日數據計算每月的平均值。這是如下實現:

monthly=data.resample('M', how='mean') 

       HR PreciAcu RadSolar   T  Presion  Tmax 
Fecha                   
2011-01-31 68.586207 3.744828 163.379310 17.496552  0 25.875862 
2011-02-28 68.666667 1.966667 208.000000 18.854167  0 28.879167 
2011-03-31 69.136364 3.495455 218.090909 20.986364  0 30.359091 
2011-04-30 68.956522 1.913043 221.130435 22.165217  0 31.708696 
2011-05-31 72.700000 0.550000 201.100000 18.900000  0 27.460000 
2011-06-30 70.821429 6.050000 214.000000 23.032143  0 30.621429 
2011-07-31 78.034483 5.810345 188.206897 21.503448  0 27.951724 
2011-08-31 71.750000 1.028571 214.750000 22.439286  0 30.657143 
2011-09-30 72.481481 0.185185 196.962963 21.714815  0 29.596296  
2011-10-31 68.083333 1.770833 224.958333 18.683333  0 27.075000 
2011-11-30 71.750000 0.812500 169.625000 18.925000  0 26.237500 
2011-12-31 71.833333 0.160000 159.533333 17.260000  0 25.403333 

的第一個錯誤,我覺得是在降水的列,因爲所有的觀測月份爲0,併爲這個特定月份獲得的3.74的平均水平。

當在Excel中求平均值並將它們與上面的結果進行比較時,存在顯着差異。例如,HR爲Febrero的平均值爲

mean HR using pandas=68.66 

    mean HR using excel=67 

另一個細節,我發現:

data['PreciAcu']['2011-01'].count() 

    29 and should be 31 

難道我做錯了什麼? 我如何解決這個錯誤?

附件CSV文件:

[鏈接] https://www.dropbox.com/s/p5hl137bqm82j41/anual.csv

+0

您可能需要發佈csv文件以獲得此答案。 – TomAugspurger

+0

[link] https://www.dropbox.com/s/p5hl137bqm82j41/anual.csv – user1345283

回答

4

您的日期列被誤解,因爲它是在DD/MM/YYYY格式。設置dayfirst=True代替:

>>> df = pd.read_csv('anual.csv', parse_dates='Fecha', dayfirst=True, index_col=0, sep="\s+") 
>>> df['PreciAcu']['2011-01'].count() 
31 
>>> df.resample("M", how='mean') 
        HR PreciAcu RadSolar   T Presion  Tmax \ 
Fecha                   
2011-01-31 68.774194 0.000000 162.354839 16.535484  0 25.393548 
2011-02-28 67.000000 0.000000 193.481481 15.418519  0 25.696296 
2011-03-31 59.083333 0.850000 254.541667 21.295833  0 32.325000 
2011-04-30 61.200000 1.312000 260.640000 24.676000  0 34.760000 
2011-05-31  NaN  NaN   NaN  NaN  NaN  NaN 
2011-06-30 68.428571 8.576190 236.619048 25.009524  0 32.028571 
2011-07-31 81.518519 11.488889 185.407407 22.429630  0 27.681481 
2011-08-31 76.451613 0.677419 219.645161 23.677419  0 30.719355 
2011-09-30 77.533333 2.883333 196.100000 21.573333  0 28.723333 
2011-10-31 73.120000 1.260000 196.280000 19.552000  0 27.636000 
2011-11-30 71.277778 -79.333333 148.555556 18.250000  0 26.511111 
2011-12-31 73.741935 0.067742 134.677419 15.687097  0 24.019355 

       HRmax Presionmax  Tmin 
Fecha           
2011-01-31 92.709677   0 10.909677 
2011-02-28 92.111111   0 8.325926 
2011-03-31 89.291667   0 13.037500 
2011-04-30 89.400000   0 17.328000 
2011-05-31  NaN   NaN  NaN 
2011-06-30 92.095238   0 19.761905 
2011-07-31 97.185185   0 18.774074 
2011-08-31 96.903226   0 18.670968 
2011-09-30 97.200000   0 16.373333 
2011-10-31 97.000000   0 13.412000 
2011-11-30 94.555556   0 11.877778 
2011-12-31 94.161290   0 10.070968 

[12 rows x 9 columns] 

(注意,雖然 - 我忘了這一點 - dayfirst=True不嚴格,看到here也許使用date_parser會更安全。)

+0

感謝我實現我的目標的建議,現在我如何更改日期列的格式。取而代之的是2011-01-31,得到Jannuary等 – user1345283

+0

@ user1345283:如果您有其他問題,請打開一個新問題。 StackOverflow的問題/答案格式,而不是一個線程。 – DSM