大熊貓read_csv（）輸入本地日期時間字符串，tz_convert爲UTC

我用熊貓-0.8rc2讀取與UTC缺乏偏移信息本地化日期時間字符串的兩列輸入CSV，以及需要數據框系列正確地轉換爲UTC 。大熊貓read_csv（）輸入本地日期時間字符串，tz_convert爲UTC

我一直在嘗試解決方法，以減輕時間戳列代表索引，他們是數據的事實。 tz_localize和tz_convert顯然只在一個序列/數據幀的索引上工作，而不是列。我非常喜歡學習更好的方法來做到這一點，而不是下面的代碼：

# test.py 
import pandas 

# input.csv: 
# starting,ending,measure 
# 2012-06-21 00:00,2012-06-23 07:00,77 
# 2012-06-23 07:00,2012-06-23 16:30,65 
# 2012-06-23 16:30,2012-06-25 08:00,77 
# 2012-06-25 08:00,2012-06-26 12:00,0 
# 2012-06-26 12:00,2012-06-27 08:00,77 

df = pandas.read_csv('input.csv', parse_dates=[0,1]) 
print df 

ser_starting = df.starting 
ser_starting.index = ser_starting.values 
ser_starting = ser_starting.tz_localize('US/Eastern') 
ser_starting = ser_starting.tz_convert('UTC') 

ser_ending = df.ending 
ser_ending.index = ser_ending.values 
ser_ending = ser_ending.tz_localize('US/Eastern') 
ser_ending = ser_ending.tz_convert('UTC') 

df.starting = ser_starting.index 
print df 
df.ending = ser_ending.index 
print df

二，代碼是遇到了一些奇怪的行爲。它改變了時間戳第二次轉讓的數據返回到數據幀，順序是 df.starting或df.ending：

$ python test.py 
       starting    ending measure 
0 2012-06-21 00:00:00 2012-06-23 07:00:00  77 
1 2012-06-23 07:00:00 2012-06-23 16:30:00  65 
2 2012-06-23 16:30:00 2012-06-25 08:00:00  77 
3 2012-06-25 08:00:00 2012-06-26 12:00:00  0 
4 2012-06-26 12:00:00 2012-06-27 08:00:00  77 
      starting    ending measure 
0 2012-06-21 04:00:00 2012-06-23 07:00:00  77 
1 2012-06-23 11:00:00 2012-06-23 16:30:00  65 
2 2012-06-23 20:30:00 2012-06-25 08:00:00  77 
3 2012-06-25 12:00:00 2012-06-26 12:00:00  0 
4 2012-06-26 16:00:00 2012-06-27 08:00:00  77 
Traceback (most recent call last): 
    File "test.py", line 28, in <module> 
    print df 
    File "/path/to/lib/python2.7/site-packages/pandas/core/frame.py", line 572, in __repr__ 
    if self._need_info_repr_(): 
    File "/path/to/lib/python2.7/site-packages/pandas/core/frame.py", line 560, in _need_info_repr_ 
    self.to_string(buf=buf) 
    File "/path/to/lib/python2.7/site-packages/pandas/core/frame.py", line 1207, in to_string 
    formatter.to_string(force_unicode=force_unicode) 
    File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 200, in to_string 
    fmt_values = self._format_col(i) 
    File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 242, in _format_col 
    space=self.col_space) 
    File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 462, in format_array 
    return fmt_obj.get_result() 
    File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 589, in get_result 
    fmt_values = [formatter(x) for x in self.values] 
    File "/path/to/lib/python2.7/site-packages/pandas/core/format.py", line 597, in _format_datetime64 
    base = stamp.strftime('%Y-%m-%d %H:%M:%S') 
ValueError: year=1768 is before 1900; the datetime strftime() methods require year >= 1900

打印語句只是爲了說明問題。不正確的值將毫無例外地進行，如果我避免repr和其他方法調用 strftime。

奇怪的是，如果我繼續調用DF {開始，結束}在REPL分配，我通常用一個正確的數據幀結束，時間戳：

In [151]: df 
Out[151]: 
      starting    ending measure 
0 2012-06-21 04:00:00 2012-06-23 11:00:00 77 
1 2012-06-23 11:00:00 2012-06-23 20:30:00 65 
2 2012-06-23 20:30:00 2012-06-25 12:00:00 77 
3 2012-06-25 12:00:00 2012-06-26 16:00:00 0 
4 2012-06-26 16:00:00 2012-06-27 12:00:00 77

這是不可重複，AFAICT，我無法描述得過去的上述呼籲ValueError錯誤的精確序列，但它確實

我很感激我是否如果要對付一個bug，任何想法，或者這是不受支持API使用情況。

正如上面提到的，我寧願只是學習更好的使用熊貓API 以避免這樣做。

來源

2012-06-24 Jeff Kowalczyk