2016-09-26 47 views
7

我需要幫助轉換我的數據,以便我可以讀取事務數據。根據列中的條件創建組/類別

商業案例

我想組一起一定的關聯交易,以創建活動的一些羣體或階層。這個數據集代表了工作人員出席各種缺席活動。我想根據離開事件類365天內的任何交易創建一類葉子。爲了繪製趨勢圖,我想給這些類編號,以便得到一個序列/模式。

我的代碼允許我查看第一個事件發生的時間,它可以識別新類何時開始,但不會將每個事務分爲一個類。

要求:

  • 標籤的所有行依據是什麼讓他們班落入。
  • 爲每個唯一的離開事件編號。使用該實施例中索引0將是獨特的假事件2,索引1將是獨特的假事件2,索引3將是獨特的假事件2和索引4將是獨特的假事件1等

我加在所需輸出的列中標記爲「期望輸出」。請注意,每個人可以有更多的行/事件;而且可能會有更多的人。

一些數據

import pandas as pd 

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"], 
     'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"], 
     'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]} 
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output']) 

一些代碼,我已經試過

df['Effective Date'] = df['Effective Date'].astype('datetime64[ns]') 
df['EmplidShift'] = df['Employee ID'].shift(-1) 
df['Effdt-Shift'] = df['Effective Date'].shift(-1) 
df['Prior Row in Same Emplid Class'] = "No" 
df['Effdt Diff'] = df['Effdt-Shift'] - df['Effective Date'] 
df['Effdt Diff'] = (pd.to_timedelta(df['Effdt Diff'], unit='d') + pd.to_timedelta(1,unit='s')).astype('timedelta64[D]') 
df['Cumul. Count'] = df.groupby('Employee ID').cumcount() 


df['Groupby'] = df.groupby('Employee ID')['Cumul. Count'].transform('max') 
df['First Row Appears?'] = "" 
df['First Row Appears?'][df['Cumul. Count'] == df['Groupby']] = "First Row" 
df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes" 

df['Prior Row in Same Emplid Class'][ df['Employee ID'] == df['EmplidShift']] = "Yes" 

df['Effdt > 1 Yr?'] = ""           
df['Effdt > 1 Yr?'][ ((df['Prior Row in Same Emplid Class'] == "Yes") & (df['Effdt Diff'] < -365)) ] = "Yes" 

df['Unique Leave Event'] = "" 
df['Unique Leave Event'][ (df['Effdt > 1 Yr?'] == "Yes") | (df['First Row Appears?'] == "First Row") ] = "Unique Leave Event" 

df 

回答

2

你可以做到這一點,而不必循環或遍歷你的數據框。根據Wes McKinney,您可以將.apply()與groupBy對象一起使用,並定義一個應用於groupby對象的函數。如果您使用.shift()like here),您可以在不使用任何循環的情況下得到結果。

簡潔例如:

# Group by Employee ID 
grouped = df.groupby("Employee ID") 
# Define function 
def get_unique_events(group): 
    # Convert to date and sort by date, like @Khris did 
    group["Effective Date"] = pd.to_datetime(group["Effective Date"]) 
    group = group.sort_values("Effective Date") 
    event_series = (group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days')).apply(lambda x: int(x)).cumsum()+1 
    return event_series 

event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0) 
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True) 
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x)) 
df['Match'] = df['Desired Output'] == df['Output'] 

print(df) 

輸出:

Employee ID Effective Date  Desired Output Unique Event \ 
3   100  2013-01-01 Unique Leave Event 1    1 
2   100  2014-07-01 Unique Leave Event 2    2 
1   100  2015-06-05 Unique Leave Event 2    2 
0   100  2016-01-01 Unique Leave Event 2    2 
6   200  2013-01-01 Unique Leave Event 1    1 
5   200  2015-01-01 Unique Leave Event 2    2 
4   200  2016-01-01 Unique Leave Event 2    2 
7   300  2014-01 Unique Leave Event 1    1 

       Output Match 
3 Unique Leave Event 1 True 
2 Unique Leave Event 2 True 
1 Unique Leave Event 2 True 
0 Unique Leave Event 2 True 
6 Unique Leave Event 1 True 
5 Unique Leave Event 2 True 
4 Unique Leave Event 2 True 
7 Unique Leave Event 1 True 

爲了清楚更詳細的例如:

import pandas as pd 

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"], 
     'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01"], 
     'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]} 
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output']) 

# Group by Employee ID 
grouped = df.groupby("Employee ID") 

# Define a function to get the unique events 
def get_unique_events(group): 
    # Convert to date and sort by date, like @Khris did 
    group["Effective Date"] = pd.to_datetime(group["Effective Date"]) 
    group = group.sort_values("Effective Date") 
    # Define a series of booleans to determine whether the time between dates is over 365 days 
    # Use .shift(1) to look back one row 
    is_year = group["Effective Date"] - group["Effective Date"].shift(1) > pd.Timedelta('365 days') 
    # Convert booleans to integers (0 for False, 1 for True) 
    is_year_int = is_year.apply(lambda x: int(x))  
    # Use the cumulative sum function in pandas to get the cumulative adjustment from the first date. 
    # Add one to start the first event as 1 instead of 0 
    event_series = is_year_int.cumsum() + 1 
    return event_series 

# Run function on df and put results into a new dataframe 
# Convert Employee ID back from an index to a column with .reset_index(level=0) 
event_df = pd.DataFrame(grouped.apply(get_unique_events).rename("Unique Event")).reset_index(level=0) 

# Merge the dataframes 
df = pd.merge(df, event_df[['Unique Event']], left_index=True, right_index=True) 

# Add string to match desired format 
df['Output'] = df['Unique Event'].apply(lambda x: "Unique Leave Event " + str(x)) 

# Check to see if output matches desired output 
df['Match'] = df['Desired Output'] == df['Output'] 

print(df) 

您可以得到相同的輸出:

Employee ID Effective Date  Desired Output Unique Event \ 
3   100  2013-01-01 Unique Leave Event 1    1 
2   100  2014-07-01 Unique Leave Event 2    2 
1   100  2015-06-05 Unique Leave Event 2    2 
0   100  2016-01-01 Unique Leave Event 2    2 
6   200  2013-01-01 Unique Leave Event 1    1 
5   200  2015-01-01 Unique Leave Event 2    2 
4   200  2016-01-01 Unique Leave Event 2    2 
7   300  2014-01 Unique Leave Event 1    1 

       Output Match 
3 Unique Leave Event 1 True 
2 Unique Leave Event 2 True 
1 Unique Leave Event 2 True 
0 Unique Leave Event 2 True 
6 Unique Leave Event 1 True 
5 Unique Leave Event 2 True 
4 Unique Leave Event 2 True 
7 Unique Leave Event 1 True 
+0

這是一個優雅的解決方案。如果OP使用真正巨大的數據幀,唯一的危險可能在於「合併」,但從數據內容來看,這不太可能。 – Khris

3

這是一個有點笨重,但它產生正確的輸出至少爲你的小例子:

import pandas as pd 

data = {'Employee ID': ["100", "100", "100","100","200","200","200","300"], 
     'Effective Date': ["2016-01-01","2015-06-05","2014-07-01","2013-01-01","2016-01-01","2015-01-01","2013-01-01","2014-01-01"], 
     'Desired Output': ["Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 2","Unique Leave Event 2","Unique Leave Event 1","Unique Leave Event 1"]} 
df = pd.DataFrame(data, columns=['Employee ID','Effective Date','Desired Output']) 

df["Effective Date"] = pd.to_datetime(df["Effective Date"]) 
df = df.sort_values(["Employee ID","Effective Date"]).reset_index(drop=True) 

for i,_ in df.iterrows(): 
    df.ix[0,"Result"] = "Unique Leave Event 1" 
    if i < len(df)-1: 
    if df.ix[i+1,"Employee ID"] == df.ix[i,"Employee ID"]: 
     if df.ix[i+1,"Effective Date"] - df.ix[i,"Effective Date"] > pd.Timedelta('365 days'): 
     df.ix[i+1,"Result"] = "Unique Leave Event " + str(int(df.ix[i,"Result"].split()[-1])+1) 
     else: 
     df.ix[i+1,"Result"] = df.ix[i,"Result"] 
    else: 
     df.ix[i+1,"Result"] = "Unique Leave Event 1" 

備註該代碼假定第一行始終包含字符串Unique Leave Event 1

編輯:一些解釋。

首先,我將日期轉換爲日期時間格式,然後重新排序數據框,以便每個員工ID的日期都是遞增的。

然後我使用內置int迭代器iterrows迭代幀的行。在for i,_中的_僅僅是我不使用的第二個變量的佔位符,因爲迭代器同時返回行號和行,我只需要這裏的數字。

在迭代器中,我正在進行按行比較,所以默認情況下我手動填充第一行,然後分配給第i+1行。我這樣做是因爲我知道第一行的值,而不是最後一行的值。然後我比較i+1-行與i-0123fe-safe內的第012行,因爲i+1會在最後一次迭代中給出索引錯誤。

在循環中,我首先檢查Employee ID是否在兩行之間發生了變化。如果沒有,那麼我比較兩行的日期,看看它們是否分開超過365天。如果是這種情況,我從i行讀取字符串"Unique Leave Event X",將數字增加1並將其寫入i+1 -row。如果日期更近,我只需複製前一行的字符串。

如果Employee ID確實改變另一方面,我只寫"Unique Leave Event 1"重新開始。

注1:iterrows()沒有設置選項,所以我不能只遍歷子集。注意2:總是使用其中一個內置迭代器進行迭代,只有在不能解決問題時才進行迭代。注意3:在迭代中分配值時,始終使用ix,lociloc

+0

謝謝!你能否提供一些關於你如何做到的評論? – Christopher

+0

嗨,抱歉等了很長時間,我只在這裏評論工作,我們有一個爲期三天的週末。我現在會添加一些評論。 – Khris