我用熊貓-0.8rc2讀取與UTC缺乏偏移信息本地化 日期時間字符串的兩列輸入CSV,以及需要數據框系列 正確地轉換爲UTC 。大熊貓read_csv()輸入本地日期時間字符串,tz_convert爲UTC
我一直在嘗試解決方法,以減輕時間戳列 代表索引,他們是數據的事實。 tz_localize和tz_convert顯然只在一個序列/數據幀的索引上工作 ,而不是列。我非常喜歡 學習更好的方法來做到這一點,而不是下面的代碼:
# test.py
import pandas
# input.csv:
# starting,ending,measure
# 2012-06-21 00:00,2012-06-23 07:00,77
# 2012-06-23 07:00,2012-06-23 16:30,65
# 2012-06-23 16:30,2012-06-25 08:00,77
# 2012-06-25 08:00,2012-06-26 12:00,0
# 2012-06-26 12:00,2012-06-27 08:00,77
df = pandas.read_csv('input.csv', parse_dates=[0,1])
print df
ser_starting = df.starting
ser_starting.index = ser_starting.values
ser_starting = ser_starting.tz_localize('US/Eastern')
ser_starting = ser_starting.tz_convert('UTC')
ser_ending = df.ending
ser_ending.index = ser_ending.values
ser_ending = ser_ending.tz_localize('US/Eastern')
ser_ending = ser_ending.tz_convert('UTC')
df.starting = ser_starting.index
print df
df.ending = ser_ending.index
print df
二,代碼是遇到了一些奇怪的行爲。它改變了時間戳第二次轉讓的 數據返回到數據幀,順序是 df.starting或df.ending:
$ python test.py
starting ending measure
0 2012-06-21 00:00:00 2012-06-23 07:00:00 77
1 2012-06-23 07:00:00 2012-06-23 16:30:00 65
2 2012-06-23 16:30:00 2012-06-25 08:00:00 77
3 2012-06-25 08:00:00 2012-06-26 12:00:00 0
4 2012-06-26 12:00:00 2012-06-27 08:00:00 77
starting ending measure
0 2012-06-21 04:00:00 2012-06-23 07:00:00 77
1 2012-06-23 11:00:00 2012-06-23 16:30:00 65
2 2012-06-23 20:30:00 2012-06-25 08:00:00 77
3 2012-06-25 12:00:00 2012-06-26 12:00:00 0
4 2012-06-26 16:00:00 2012-06-27 08:00:00 77
Traceback (most recent call last):
File "test.py", line 28, in <module>
print df
File "/path/to/lib/python2.7/site-packages/pandas/core/frame.py", line 572, in __repr__
if self._need_info_repr_():
File "/path/to/lib/python2.7/site-packages/pandas/core/frame.py", line 560, in _need_info_repr_
self.to_string(buf=buf)
File "/path/to/lib/python2.7/site-packages/pandas/core/frame.py", line 1207, in to_string
formatter.to_string(force_unicode=force_unicode)
File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 200, in to_string
fmt_values = self._format_col(i)
File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 242, in _format_col
space=self.col_space)
File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 462, in format_array
return fmt_obj.get_result()
File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 589, in get_result
fmt_values = [formatter(x) for x in self.values]
File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 597, in _format_datetime64
base = stamp.strftime('%Y-%m-%d %H:%M:%S')
ValueError: year=1768 is before 1900; the datetime strftime() methods require year >= 1900
打印語句只是爲了說明問題。不正確的值 將毫無例外地進行,如果我避免repr和其他方法調用 strftime。
奇怪的是,如果我繼續調用DF {開始,結束}在REPL分配 ,我通常用一個正確的數據幀結束,時間戳:
In [151]: df
Out[151]:
starting ending measure
0 2012-06-21 04:00:00 2012-06-23 11:00:00 77
1 2012-06-23 11:00:00 2012-06-23 20:30:00 65
2 2012-06-23 20:30:00 2012-06-25 12:00:00 77
3 2012-06-25 12:00:00 2012-06-26 16:00:00 0
4 2012-06-26 16:00:00 2012-06-27 12:00:00 77
這是不可重複,AFAICT,我無法描述得過去的上述呼籲ValueError錯誤的 精確序列,但它確實
我很感激我是否如果要對付一個bug, 任何想法,或者這是不受支持API使用情況。
正如上面提到的,我寧願只是學習更好的使用熊貓API 以避免這樣做。