我需要幫助轉換我的數據,以便我可以讀取事務數據。根據列中的條件創建組/類別
商業案例
我想組一起一定的關聯交易,以創建活動的一些羣體或階層。這個數據集代表了工作人員出席各種缺席活動。我想根據離開事件類365天內的任何交易創建一類葉子。爲了繪製趨勢圖,我想給這些類編號,以便得到一個序列/模式。
我的代碼允許我查看第一個事件發生的時間,它可以識別新類何時開始,但不會將每個事務分爲一個類。
要求:
- 標籤的所有行依據是什麼讓他們班落入。
- 爲每個唯一的離開事件編號。使用該實施例中索引0將是獨特的假事件2,索引1將是獨特的假事件2,索引3將是獨特的假事件2和索引4將是獨特的假事件1等
我加在所需輸出的列中標記爲「期望輸出」。請注意,每個人可以有更多的行/事件;而且可能會有更多的人。
一些數據
import pandas as pd
data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"],
'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"],
'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]}
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output'])
一些代碼,我已經試過
df['Effective Date'] = df['Effective Date'].astype('datetime64[ns]')
df['EmplidShift'] = df['Employee ID'].shift(-1)
df['Effdt-Shift'] = df['Effective Date'].shift(-1)
df['Prior Row in Same Emplid Class'] = "No"
df['Effdt Diff'] = df['Effdt-Shift'] - df['Effective Date']
df['Effdt Diff'] = (pd.to_timedelta(df['Effdt Diff'], unit='d') + pd.to_timedelta(1,unit='s')).astype('timedelta64[D]')
df['Cumul. Count'] = df.groupby('Employee ID').cumcount()
df['Groupby'] = df.groupby('Employee ID')['Cumul. Count'].transform('max')
df['First Row Appears?'] = ""
df['First Row Appears?'][df['Cumul. Count'] == df['Groupby']] = "First Row"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes"
df['Effdt > 1 Yr?'] = ""
df['Effdt > 1 Yr?'][ ((df['Prior Row in Same Emplid Class'] == "Yes") & (df['Effdt Diff'] < -365)) ] = "Yes"
df['Unique Leave Event'] = ""
df['Unique Leave Event'][ (df['Effdt > 1 Yr?'] == "Yes") | (df['First Row Appears?'] == "First Row") ] = "Unique Leave Event"
df
這是一個優雅的解決方案。如果OP使用真正巨大的數據幀,唯一的危險可能在於「合併」,但從數據內容來看,這不太可能。 – Khris