2017-05-04 40 views
0

我有兩個數據框,每個數據框都有關於開始和結束時間事件的信息。問題是這兩個數據幀有不同的開始和結束時間,因爲它們測量的是不同的東西。小麥我想要做的是創造新的事件,其中包含兩個信息。這些必須基於兩個數據幀之間的任何分割進行分割。例如:熊貓在開始和結束時間加入兩個數據幀不等於

數據框答:

Start    End 
2016-12-30 18:51:00 2016-12-30 19:37:00 
2016-12-30 20:03:00 2016-12-30 20:11:00 
2016-12-30 20:12:00 2016-12-30 21:02:00 
2016-12-30 21:02:00 2016-12-30 21:04:00 
2016-12-30 21:10:00 2016-12-30 21:12:00 
2016-12-30 21:12:00 2016-12-30 21:32:00 

數據幀B:

Start    End 
2016-12-30 18:33:45 2016-12-30 19:18:00 
2016-12-30 19:18:00 2016-12-30 19:38:00 
2016-12-30 19:38:00 2016-12-30 19:48:00 
2016-12-30 19:48:00 2016-12-30 20:15:45 
2016-12-30 20:15:45 2016-12-30 20:35:45 
2016-12-30 20:35:45 2016-12-30 20:45:45 
2016-12-30 20:45:45 2016-12-30 21:14:30 
2016-12-30 21:14:30 2016-12-30 21:35:00 

對於這些理想的輸出將是:

Start    End 
2016-12-30 18:51:00 2016-12-30 19:18:00 
2016-12-30 19:18:00 2016-12-30 19:37:00 
2016-12-30 20:03:00 2016-12-30 20:11:00 
2016-12-30 20:12:00 2016-12-30 20:15:45 
2016-12-30 20:15:45 2016-12-30 20:35:45 
2016-12-30 20:35:45 2016-12-30 20:45:45 
2016-12-30 20:45:45 2016-12-30 21:12:00 
2016-12-30 21:12:00 2016-12-30 21:14:30 
2016-12-30 21:14:30 2016-12-30 21:32:00 

有一對夫婦的方法,我知道這個怎麼做。我可以將數據框分解爲分鐘級別並在幾分鐘內合併,但問題在於每個數據框都是200萬行,這將是一個非常漫長的過程。

我也有SQL可以做到這一點,但是當我試圖運行它時,它花了太長時間,DBA殺死了這個進程。

SQL的功能是:

select 
a.UNIQUE_ID, 
a, 
b, 
c, 
d, 
CASE WHEN B.START < A.START THEN A.START 
ELSE B.START END START, 
CASE WHEN B.END > A.END THEN A.END 
ELSE B.END END END 
from 
(Select 
UNIQUE_ID, 
START, 
END, 
a, 
b,  
from table_1 
)a 
    join 
(
UNIQUE_ID, 
Select 
START, 
END, 
c, 
d  
from table_2) b 
on 1=1 
AND A.UNIQUE_ID = B.UNIQUE_ID 
AND ((b.START between a.START and a.END) 
or (b.end between a.START and a.END) 
or (b.START < a.START and b.end > a.end) 
or (a.START < b.START and a.end > b.end) 
) 
) a 

這使得一排開始的每對組合,包含對於UNIQUE_ID至少一個相同分鐘結束時間。然後它使用case語句將每行縮減爲共享分鐘。

我想不出一種有效的方式來使用Pandas在python中複製這個SQL。我在熊貓中唯一知道的合併函數必須具有相同的列進行合併,它們不能是像我使用的連接那樣的範圍。

是否有大熊貓一類合併的,我可以用做類似的東西:

AND ((b.START between a.START and a.END) 
or (b.end between a.START and a.END) 
or (b.START < a.START and b.end > a.end) 
or (a.START < b.START and a.end > b.end) 
) 

我能想到的唯一的辦法是遍歷每行中的DF切片回另一個數據幀到只有在該行的DF b中具有分鐘的行,然後在這兩個片上合併,並將所有這些合併連接成一個新的DF,但這將花費很長時間。

任何幫助表示讚賞。

+0

於是我找到了工作但是,這似乎還在起作用,但我仍然會聽到任何人在大熊貓身上做出這樣的回答。 我在做什麼是使用軟件包pandasql創建一個sqlite數據庫的DF和執行SQL我知道的作品。這是一個非常漂亮的軟件包。 – user6745154

回答

0

我要使用我的question這是問類似你有什麼書面的實現:

import pandas as pd 

df_a = pd.DataFrame({'Start': ['2016-12-30 18:51:00', 
           '2016-12-30 20:03:00', 
           '2016-12-30 20:12:00', 
           '2016-12-30 21:02:00', 
           '2016-12-30 21:10:00', 
           '2016-12-30 21:12:00'], 
        'End': ['2016-12-30 19:37:00', 
          '2016-12-30 20:11:00', 
          '2016-12-30 21:02:00', 
          '2016-12-30 21:04:00', 
          '2016-12-30 21:12:00', 
          '2016-12-30 21:32:00']}) 
df_b = pd.DataFrame({'Start': ['2016-12-30 18:33:45', 
           '2016-12-30 19:18:00', 
           '2016-12-30 19:38:00', 
           '2016-12-30 19:48:00', 
           '2016-12-30 20:15:45', 
           '2016-12-30 20:35:45', 
           '2016-12-30 20:45:45', 
           '2016-12-30 21:14:30'], 
        'End': ['2016-12-30 19:18:00', 
          '2016-12-30 19:38:00', 
          '2016-12-30 19:48:00', 
          '2016-12-30 20:15:45', 
          '2016-12-30 20:35:45', 
          '2016-12-30 20:45:45', 
          '2016-12-30 21:14:30', 
          '2016-12-30 21:35:00']}) 

# Convert the strings to datetime 
df_a['Start'] = pd.to_datetime(df_a['Start'], format='%Y-%m-%d %H:%M:%S') 
df_a['End'] = pd.to_datetime(df_a['End'], format='%Y-%m-%d %H:%M:%S') 
df_b['Start'] = pd.to_datetime(df_b['Start'], format='%Y-%m-%d %H:%M:%S') 
df_b['End'] = pd.to_datetime(df_b['End'], format='%Y-%m-%d %H:%M:%S') 

# Create labels for the two datasets 
# These labels will help determine the overlaps downstream 
df_a['Label'] = 'a' 
df_b['Label'] = 'b' 

# With the labels created, I can concatenate the dataframes now 
df_concat = pd.concat([df_a, df_b]) 
df_concat = df_concat[['Label', 'Start', 'End']] # Ordering the columns 

# Convert the dataframe to a list of tuples 
df_concat_rec = df_concat.to_records(index=False) 

# Here's where I'm using my answer that I had used in the other question 
timelist_new = [] 
for time in df_concat_rec: 
    timelist_new.append((time[0], time[1], 'begin')) 
    timelist_new.append((time[0], time[2], 'end')) 

timelist_new = sorted(timelist_new, key=lambda x: x[1]) 

key = None 
keylist = set() 
aggregator = [] 

for idx in range(len(timelist_new[:-1])): 
    t1 = timelist_new[idx] 
    t2 = timelist_new[idx + 1] 
    t1_key = str(t1[0]) 
    t2_key = str(t2[0]) 
    t1_dt = t1[1] 
    t2_dt = t2[1] 
    t1_pointer = t1[2] 
    t2_pointer = t2[2] 

    if t1_dt == t2_dt: 
     keylist.add(t1_key) 
     keylist.add(t2_key) 
    elif t1_dt < t2_dt: 
     if t1_pointer == 'begin': 
      keylist.add(t1_key) 
     if t1_pointer == 'end': 
      keylist.discard(t1_key) 

    key = ','.join(sorted(keylist)) 
    aggregator.append((key, t1_dt, t2_dt)) 

# This is where I filter out any records where there isn't an overlap and where the start and end dates are equal 
filtered = [x for x in aggregator if ((len(x[0]) > 1) & (x[1] != x[2]))] 

# Convert the list of tuples back to dataframe 
final_df = pd.DataFrame.from_records(filtered, columns=['Label', 'Start', 'End']) 

# Print final dataframe 
print(final_df) 

輸出:

Label    Start     End 
0 a,b 2016-12-30 18:51:00 2016-12-30 19:18:00 
1 a,b 2016-12-30 19:18:00 2016-12-30 19:37:00 
2 a,b 2016-12-30 20:03:00 2016-12-30 20:11:00 
3 a,b 2016-12-30 20:12:00 2016-12-30 20:15:45 
4 a,b 2016-12-30 20:15:45 2016-12-30 20:35:45 
5 a,b 2016-12-30 20:35:45 2016-12-30 20:45:45 
6 a,b 2016-12-30 20:45:45 2016-12-30 21:02:00 
7 a,b 2016-12-30 21:02:00 2016-12-30 21:04:00 
8 a,b 2016-12-30 21:10:00 2016-12-30 21:12:00 
9 a,b 2016-12-30 21:12:00 2016-12-30 21:14:30 
10 a,b 2016-12-30 21:14:30 2016-12-30 21:32:00 
相關問題