2016-01-13 41 views
0

我試圖創建一個隊列分析,顯示隨着時間的推移獨特購買的發展,特殊條件是隊列組應該只包含在第一個訂單上使用折扣券的用戶。隊列組條件

我的數據集是這樣的:

import numpy as np 
import pandas as pd 

data_set = list(data_set) 
df = pd.DataFrame(data_set) 
df['OrderPeriod'] = df.submitted_at.apply(lambda x: x.strftime('%Y-%m')) 

df.set_index('submitted_by_id', inplace=True) 
df['CohortGroup'] = df.groupby(level=0)['submitted_at'].min().apply(lambda x: x.strftime('%Y, %m')) 
df.reset_index(inplace=True) 

grouped = df.groupby(['CohortGroup', 'OrderPeriod']) 

cohorts = grouped.agg({ 
    'submitted_by_id': pd.Series.nunique, 
    'id': pd.Series.nunique, 
}) 

cohorts.rename(columns={'id': 'TotalOrdersInPeriod', 'submitted_by_id': 'TotalUsers'}, inplace=True); 

cohorts = cohorts.groupby(level=0).apply(cohort_period) 
cohorts.reset_index(inplace=True) 
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True) 

cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first() 
cohorts['TotalOrders'] = cohorts.groupby(level=0).TotalOrdersInPeriod.cumsum() 

total_buys = cohorts['TotalOrders'].unstack(0).divide(cohort_group_size, axis=1) 

這將顯示我的同夥這樣

CohortGroup  2015, 01 2015, 02 
CohortPeriod                
1    1   1 
2    1.5 

╔════╦═════════════════╦══════════════╦═══════════╗ 
║ id ║ submitted_by_id ║ submitted_at ║ coupon_id ║ 
╠════╬═════════════════╬══════════════╬═══════════╣ 
║ 1 ║    1 ║ 2015-01-01 ║   ║ 
║ 2 ║    2 ║ 2015-01-02 ║   1 ║ 
║ 3 ║    1 ║ 2015-02-02 ║   1 ║ 
║ 4 ║    3 ║ 2015-02-02 ║   ║ 
║... ║    ... ║  ... ║  ... ║ 
╚════╩═════════════════╩══════════════╩═══════════╝ 

所以,我可以過這樣整個數據集創建隊列分析

所以我想要的是以某種方式限制我的隊列組到那些第一次訂購的客戶pon_id。

所以我得到的表是這樣的

CohortGroup  2015, 01 2015, 02 
CohortPeriod                
1    1   NaN 
2    1 

如何去與?

幸得http://www.gregreda.com/2015/08/23/cohort-analysis-with-python/

回答

0

與開始:

id submitted_by_id submitted_at coupon_id 
0 1    1 2015-01-01  NaN 
1 2    2 2015-01-02   1 
2 3    1 2015-02-02   1 
3 4    3 2015-02-02  NaN 

你可以讓你的隊列組和時間如下:

df['order_period'] = pd.to_datetime(df.submitted_at).dt.to_period('M') 
df = df.rename(columns={'submitted_by_id': 'customer_id'}).drop(['id', 'submitted_at'], axis=1) 
df['cohort_group'] = df.sort_values('order_period').groupby('customer_id')['order_period'].transform(lambda x: x.head(1)) 
df['cohort_period'] = df.groupby(['cohort_group', 'customer_id'])['order_period'].rank() 

    customer_id coupon_id order_period cohort_group cohort_period 
0   1  NaN  2015-01  2015-01    1 
1   2   1  2015-01  2015-01    1 
2   1   1  2015-02  2015-01    2 
3   3  NaN  2015-02  2015-02    1 

現在,您可以過濾掉用戶(只有一個樣本數據)在其第一個cohort_period期間使用優惠券:

基於對 customer_id一個
coupon_customers = df.groupby(['cohort_group', 'customer_id']).apply(lambda x: x.sort_values('cohort_period').iloc[0]).dropna(subset=['coupon_id']).customer_id.tolist() 

[2] 

,因爲他們每cohort_groupcohort_period出現:

df = df.set_index(['cohort_group', 'cohort_period']).loc[:, 'customer_id'].to_frame() 

          customer_id 
cohort_group cohort_period    
2015-01  1      1 
      1      2 
      2      1 
2015-02  1      3 

你得到cohort count券:

cohort_count = df.groupby(level=['cohort_group', 'cohort_period']).count().unstack('cohort_period') 

cohort_period   1 2 
cohort_group     
2015-01     2 1 
2015-02     1 NaN 

,或者過濾掉coupon_customers,沒有優惠券:

cohort_count_no_coupons = df[~df.isin(coupon_customers)].groupby(level=['cohort_group', 'cohort_period']).count().unstack('cohort_period') 

cohort_period   1 2 
cohort_group     
2015-01     1 1 
2015-02     1 NaN 
+0

看起來很有希望,我很期待嘗試它,當我開始工作。 –

+0

我不完全確定你在最後兩行中做了什麼,你不使用你的coupon_customers,我也無法得到和你一樣的結果。 不過,我已經到了一個解決方案使用你的想法,將發佈多一點測試後,並將你的回答標記爲答案 –

+0

對不起,忘了發佈最後一行 - 產生實際結果的那一行... – Stefan

0

非常感謝Stefan指引我走向正確的方向,這就是我最終做的。我將迎來Stefans的答案,作爲接受的答案,因爲它是什麼促使我拿出我的解決方案

我擴大了測試數據集了一點,所以它看起來像現在這樣:

coupon_id final_amount id  submitted_at submitted_by_id OrderPeriod 
0  NaN   100 1 2015-01-01 14:30:00    1  2015-01 
1   1   100 2 2015-01-02 14:31:00    2  2015-01 
2   1   100 3 2015-02-02 14:31:00    1  2015-02 
3  NaN   100 4 2015-02-02 14:31:00    3  2015-02 
4  NaN   100 5 2015-02-02 14:31:00    2  2015-02 
5   2   100 6 2015-01-02 14:31:00    4  2015-01 
6   2   100 7 2015-02-03 14:31:00    5  2015-02 
7  NaN   100 8 2015-01-03 14:31:00    2  2015-01 

這是作爲一個Python dictonary:

sample_data = [ 
     {'id': 1, 
     'submitted_by_id': 1, 
     'submitted_at': datetime.datetime(2015, 1, 1, 14, 30), 
     'final_amount': Decimal('100'), 
     'coupon_id': None, 
     }, 
     {'id': 2, 
     'submitted_by_id': 2, 
     'submitted_at': datetime.datetime(2015, 1, 2, 14, 31), 
     'final_amount': Decimal('100'), 
     'coupon_id': 1, 
     }, 
     {'id': 3, 
     'submitted_by_id': 1, 
     'submitted_at': datetime.datetime(2015, 2, 2, 14, 31), 
     'final_amount': Decimal('100'), 
     'coupon_id': 1, 
     }, 
     {'id': 4, 
     'submitted_by_id': 3, 
     'submitted_at': datetime.datetime(2015, 2, 2, 14, 31), 
     'final_amount': Decimal('100'), 
     'coupon_id': None, 
     }, 
     {'id': 5, 
     'submitted_by_id': 2, 
     'submitted_at': datetime.datetime(2015, 2, 2, 14, 31), 
     'final_amount': Decimal('100'), 
     'coupon_id': None, 
     }, 
     {'id': 6, 
     'submitted_by_id': 4, 
     'submitted_at': datetime.datetime(2015, 1, 2, 14, 31), 
     'final_amount': Decimal('100'), 
     'coupon_id': 2, 
     }, 
     {'id': 7, 
     'submitted_by_id': 5, 
     'submitted_at': datetime.datetime(2015, 2, 3, 14, 31), 
     'final_amount': Decimal('100'), 
     'coupon_id': 2, 
     }, 
     {'id': 8, 
     'submitted_by_id': 2, 
     'submitted_at': datetime.datetime(2015, 1, 3, 14, 31), 
     'final_amount': Decimal('100'), 
     'coupon_id': None, 
     }, 
    ] 

這裏是溶液:

df = pd.DataFrame(sample_data) 
df['OrderPeriod'] = df.submitted_at.dt.to_period('M') 

if group in ['used_coupon', 'did_not_use_coupon']: 
    df2 = df.copy() 

    df2['CohortGroup'] = df2.sort_values('OrderPeriod').\ 
     groupby('submitted_by_id')['OrderPeriod'].transform(lambda x: x.head(1)) 
    df2['CohortPeriod'] = df2.groupby(
     ['OrderPeriod', 'submitted_by_id'] 
    )['OrderPeriod'].rank() 

    coupon_customers = df2.groupby(['CohortGroup', 'submitted_by_id']).apply(
      lambda x: x.sort_values('submitted_at').iloc[0] 
    ).dropna(subset=['coupon_id']).submitted_by_id.tolist() 

    # coupon_customers = [2, 4, 5] 

    if group == 'used_coupon': 
     # delete rows in the original dataframe where the customer is not 
     # in the coupon_customers_list 
     df = df[df['submitted_by_id'].isin(coupon_customers)] 
    # group == 'did_not_use_coupon' 
    else: 
     # delete rows in the original dataframe where the customer is 
     # in the coupon_customers_list 
     df = df[df['submitted_by_id'].isin(coupon_customers)] 

# From here it's just the same code as I originally used 
df.set_index('submitted_by_id', inplace=True) 
df['CohortGroup'] = df.groupby(level=0)['submitted_at'].min().apply(lambda x: x.to_period('M')) 

df.reset_index(inplace=True) 
print df.head() 

grouped = df.groupby(['CohortGroup', 'OrderPeriod']) 

cohorts = grouped.agg({ 
    'submitted_by_id': pd.Series.nunique, 
    'id': pd.Series.nunique, 
}) 

cohorts.rename(columns={'id': 'TotalOrdersInPeriod', 'submitted_by_id': 'TotalUsers'}, inplace=True); 

cohorts = cohorts.groupby(level=0).apply(cohort_period) 

cohorts.reset_index(inplace=True) 
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True) 

cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first() 

cohorts['TotalOrders'] = cohorts.groupby(level=0).TotalOrdersInPeriod.cumsum() 

total_buys = cohorts['TotalOrders'].unstack(0).divide(cohort_group_size, axis=1) 

結果f或組='used_coupon':

CohortPeriod 1  2 
CohortGroup  
2015-01   1.50 2.00 
2015-02   1.00