2013-03-28 304 views
3

我有一個熊貓數據框,其中一列表示另一列中的位置值是否在其下面的行中發生了變化。作爲一個例子,遍歷熊貓數據框

2013-02-05 19:45:00 (39.94, -86.159)  True 
2013-02-05 19:50:00 (39.94, -86.159)  True 
2013-02-05 19:55:00 (39.94, -86.159) False 
2013-02-05 20:00:00 (39.777, -85.995) False 
2013-02-05 20:05:00 (39.775, -85.978)  True 
2013-02-05 20:10:00 (39.775, -85.978)  True 
2013-02-05 20:15:00 (39.775, -85.978) False 
2013-02-05 20:20:00 (39.94, -86.159)  True 
2013-02-05 20:30:00 (39.94, -86.159) False 

所以,我想要做的是去逐行通過這個數據幀,並與False檢查行。然後(可能會添加另一列),在那個地方總共花費了「連續」時間。像上面的例子一樣,可以再次訪問同一個地方。在這種情況下,它被認爲是一個單獨的條件。所以,上面的例子中,是這樣的:

2013-02-05 19:45:00 (39.94, -86.159)  True 0 
2013-02-05 19:50:00 (39.94, -86.159)  True 0 
2013-02-05 19:55:00 (39.94, -86.159) False 15 
2013-02-05 20:00:00 (39.777, -85.995) False 5 
2013-02-05 20:05:00 (39.775, -85.978)  True 0 
2013-02-05 20:10:00 (39.775, -85.978)  True 0 
2013-02-05 20:15:00 (39.775, -85.978) False 15 
2013-02-05 20:20:00 (39.94, -86.159)  True 0 
2013-02-05 20:25:00 (39.94, -86.159) False 10 

我會然後繪製的使用每天HIST()函數花這些「持續」時間的直方圖。如何通過遍歷數據框從第一個數據幀中獲取第二個數據幀?我是python和熊貓的新手,真正的數據文件非常龐大,所以我需要一些相當高效的東西。

回答

7

下面是另一個需要

df['group'] = (df.condition == False).astype('int').cumsum().shift(1).fillna(0) 

df 
      date long  lat condition group 
2/5/2013 19:45:00 39.940 -86.159  True  0 
2/5/2013 19:50:00 39.940 -86.159  True  0 
2/5/2013 19:55:00 39.940 -86.159  False  0 
2/5/2013 20:00:00 39.777 -85.995  False  1 
2/5/2013 20:05:00 39.775 -85.978  True  2 
2/5/2013 20:10:00 39.775 -85.978  True  2 
2/5/2013 20:15:00 39.775 -85.978  False  2 
2/5/2013 20:20:00 39.940 -86.159  True  3 
2/5/2013 20:25:00 39.940 -86.159  False  3 

df['result'] = df.groupby(['group']).date.transform(lambda sdf: 5 *len(sdf)) 

df 
      date long  lat condition group result 
2/5/2013 19:45:00 39.940 -86.159  True  0  15 
2/5/2013 19:50:00 39.940 -86.159  True  0  15 
2/5/2013 19:55:00 39.940 -86.159  False  0  15 
2/5/2013 20:00:00 39.777 -85.995  False  1  5 
2/5/2013 20:05:00 39.775 -85.978  True  2  15 
2/5/2013 20:10:00 39.775 -85.978  True  2  15 
2/5/2013 20:15:00 39.775 -85.978  False  2  15 
2/5/2013 20:20:00 39.940 -86.159  True  3  10 
2/5/2013 20:25:00 39.940 -86.159  False  3  10 
+0

非常好! – John

4

您將需要0.11-dev。我認爲這會給你你正在尋找的東西。請參閱本節:http://pandas.pydata.org/pandas-docs/dev/timeseries.html#time-deltas更多的信息作爲timedeltas是,大熊貓是支持

繼承人您的數據更新數據(我隔長/ LAT只是爲了方便,關鍵的是, 條件列是一個布爾)

In [137]: df = pd.read_csv(StringIO.StringIO(data),index_col=0,parse_dates=True) 

In [138]: df 
Out[138]: 
       date long  lat condition 
2013-02-05 19:45:00 39.940 -86.159  True 
2013-02-05 19:50:00 39.940 -86.159  True 
2013-02-05 19:55:00 39.940 -86.159  False 
2013-02-05 20:00:00 39.777 -85.995  False 
2013-02-05 20:05:00 39.775 -85.978  True 
2013-02-05 20:10:00 39.775 -85.978  True 
2013-02-05 20:15:00 39.775 -85.978  False 
2013-02-05 20:20:00 39.940 -86.159  True 
2013-02-05 20:25:00 39.940 -86.159  False 

In [139]: df.dtypes 
Out[139]: 
date   float64 
long lat  float64 
condition  bool 
dtype: object 

創建的索引一些日期列(這些都是datetime64 [NS] D型)

In [140]: df['date'] = df.index 
In [141]: df['rdate'] = df.index 

設置RDATE列是假到NAT(np.nan的轉化到NAT)

In [142]: df.loc[~df['condition'],'rdate'] = np.nan 

正向從先前值

In [143]: df['rdate'] = df['rdate'].ffill() 

減去從日期的RDATE填充NAT的,這將產生一個timedelta64 [NS]鍵入的時間差 柱

In [144]: df['diff'] = df['date']-df['rdate'] 

In [151]: df 
Out[151]: 
            date long lat condition    rdate \ 
2013-02-05 19:45:00 2013-02-05 19:45:00 -86.159  True 2013-02-05 19:45:00 
2013-02-05 19:50:00 2013-02-05 19:50:00 -86.159  True 2013-02-05 19:50:00 
2013-02-05 19:55:00 2013-02-05 19:55:00 -86.159  False 2013-02-05 19:50:00 
2013-02-05 20:00:00 2013-02-05 20:00:00 -85.995  False 2013-02-05 19:50:00 
2013-02-05 20:05:00 2013-02-05 20:05:00 -85.978  True 2013-02-05 20:05:00 
2013-02-05 20:10:00 2013-02-05 20:10:00 -85.978  True 2013-02-05 20:10:00 
2013-02-05 20:15:00 2013-02-05 20:15:00 -85.978  False 2013-02-05 20:10:00 
2013-02-05 20:20:00 2013-02-05 20:20:00 -86.159  True 2013-02-05 20:20:00 
2013-02-05 20:25:00 2013-02-05 20:25:00 -86.159  False 2013-02-05 20:20:00 

         diff 
2013-02-05 19:45:00 00:00:00 
2013-02-05 19:50:00 00:00:00 
2013-02-05 19:55:00 00:05:00 
2013-02-05 20:00:00 00:10:00 
2013-02-05 20:05:00 00:00:00 
2013-02-05 20:10:00 00:00:00 
2013-02-05 20:15:00 00:05:00 
2013-02-05 20:20:00 00:00:00 
2013-02-05 20:25:00 00:05:00 

的diff列現在是timedelta64 [ns],所以你想在幾分鐘內整數 (FYI這是有點笨重,因爲大熊貓沒有標量類型 Ti medelta類似於日期的時間戳)

(另外,你可能必須在你完成之前對這個rdate系列做一個shift(),我認爲我在某個地方被關閉了)......但這是主意

In [175]: df['diff'].map(lambda x: x.item().seconds/60) 
Out[175]: 
2013-02-05 19:45:00  0 
2013-02-05 19:50:00  0 
2013-02-05 19:55:00  5 
2013-02-05 20:00:00 10 
2013-02-05 20:05:00  0 
2013-02-05 20:10:00  0 
2013-02-05 20:15:00  5 
2013-02-05 20:20:00  0 
2013-02-05 20:25:00  5 
+0

你也可以做'ffill(就地= TRUE)',以免造成臨時數組複製。 –