2017-01-27 113 views
3

分區中列的所有可能的排列我有數據幀,看起來像這樣:創建另一列在熊貓數據幀

Current State

我的目的是讓在:

Final State

說明:

  1. 每個客戶都有取得3個訂單
  2. 可以從每個訂單中的任意類別購買
  3. 所需狀態:獲取按訂單順序購買的客戶的所有類別的排列。第二張照片有助於更好地理解這一點
  4. 所需狀態下的類別1表示以第一順序購買的類別,類別2表示以第二順序購買的類別,等等。我使用的是

代碼:

start_time = time.time() 

df = pd.DataFrame() 
for CustomerName in base_df.CustomerName.unique(): 
    df1 = base_df[(base_df['CustomerName']== CustomerName)][['CustomerName','order_seq','Category']] 
    df2 = pd.DataFrame(index=pd.MultiIndex.from_product([subdf['Category'] for p, subdf in df1.groupby(['order_seq'])], names = df1.order_seq.unique())).reset_index() 
    df2['CustomerName'] = CustomerName 
    df = df.append(df2) 

print("--- %s seconds ---" %(time.time() - start_time)) 

此過程大約需要10分鐘,在我的數據集運行 - 尋找更快的方法。

我現在正在Pandas上工作,但是對於R或SQL的指針也很受歡迎!謝謝!

+0

這是一個排列?爲什麼顧客1只能在他的第一個訂單中訂購食物? –

+0

歡迎來到Stack Overflow!您可以先[參觀](http://stackoverflow.com/tour)並學習[如何提出一個好問題](http://stackoverflow.com/help/how-to-ask)並創建一個[最小,完整和可驗證](http://stackoverflow.com/help/mcve)示例。這使我們更容易幫助你。 –

+0

@PauloMiraMor - 不,它可能是任何東西。他本可以先購買衣服,傢俱或兩者。是的,需要按每個客戶的訂單順序排列所有產品 – Tanya

回答

0

好的。做了一些工作,但我做到了。希望能幫助到你。

import pandas as pd 
import numpy as np 
from itertools import combinations 

df = pd.DataFrame([], columns=['CustomerName','Order Sequence','Category']) 

df['CustomerName'] = [1,1,1,1,1,1,1,2,2,2,3,3,3,3] 
df['Order Sequence'] = [1,2,2,2,3,3,3,1,2,3,1,1,2,3] 
df['Category'] = ['Food','Food','Clothes','Furniture','Clothes','Food','Toys','Clothes','Toys','Food','Furniture','Toys','Food','Food'] 

df2 = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3']) 

for CN in sorted(set(df['CustomerName'])): 

    df_temp = pd.DataFrame([], columns=['CustomerName','Category1','Category2','Category3']) 

    list_OS_1 = [] 
    list_OS_2 = [] 
    list_OS_3 = [] 

    MMC = reduce(lambda x, y: x*y,df.loc[df['CustomerName']==CN, 'Order Sequence'].value_counts().values) 

    for N in np.arange(MMC/len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category'])): 

     for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==1)), 'Category']: 

      list_OS_1.append(CTG) 

    for N in np.arange(MMC/len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category'])): 

     for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==2)), 'Category']: 

      list_OS_2.append(CTG) 

    for N in np.arange(MMC/len(df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category'])): 

     for CTG in df.loc[((df['CustomerName']==CN) & (df['Order Sequence']==3)), 'Category']: 

      list_OS_3.append(CTG) 

    df_temp['Category1'] = list_OS_1 
    df_temp['Category2'] = list_OS_2 
    df_temp['Category3'] = list_OS_3 
    df_temp['CustomerName'] = CN 

    df2 = pd.concat([df2,df_temp],0) 

print (df2) 

輸出:

CustomerName Category1 Category2 Category3 
0   1.0  Food  Food Clothes 
1   1.0  Food Clothes  Food 
2   1.0  Food Furniture  Toys 
3   1.0  Food  Food Clothes 
4   1.0  Food Clothes  Food 
5   1.0  Food Furniture  Toys 
6   1.0  Food  Food Clothes 
7   1.0  Food Clothes  Food 
8   1.0  Food Furniture  Toys 
0   2.0 Clothes  Toys  Food 
0   3.0 Furniture  Food  Food 
1   3.0  Toys  Food  Food 

PS:它不是DINAMIC,因此,如果您添加或刪除它會得到fcked了類別。 但只要它遵循你從我身邊走過最初的標準,它SHLD工作

+0

謝謝!不幸的是,我可能會添加新的類別! – Tanya

1

考慮三個OrderSequence dataframes合併,每個連接到一個不同的客戶名稱

import pandas as pd 

df = pd.DataFrame({'CustomerName': [1,1,1,1,1,1,1,2,2,2,3,3,3,3], 
        'OrderSequence': [1,2,2,2,3,3,3,1,2,3,1,1,2,3], 
        'Category': ['Food','Food','Clothes','Furniture','Clothes','Food','Toys', 
           'Clothes','Toys','Food','Furniture','Toys','Food','Food']}) 

finaldf = pd.DataFrame(df['CustomerName'].drop_duplicates()) 

for i in range(1,4): 
    seqdf = df[df['OrderSequence']==i][['CustomerName', 'Category']].\    
             rename(columns={'Category':'Category'+str(i)}) 
    finaldf = pd.merge(finaldf, seqdf, on=['CustomerName']) 

print(finaldf) 

#  CustomerName Category1 Category2 Category3 
# 0    1  Food  Food Clothes 
# 1    1  Food  Food  Food 
# 2    1  Food  Food  Toys 
# 3    1  Food Clothes Clothes 
# 4    1  Food Clothes  Food 
# 5    1  Food Clothes  Toys 
# 6    1  Food Furniture Clothes 
# 7    1  Food Furniture  Food 
# 8    1  Food Furniture  Toys 
# 9    2 Clothes  Toys  Food 
# 10    3 Furniture  Food  Food 
# 11    3  Toys  Food  Food 

不可否認的是,以上設置是首先在SQL中使用自連接思想出來的,然後轉換爲熊貓:

SELECT t1.CustomerName, t2.Category AS Category1, 
     t3.Category AS Category2, t4.Category AS Category3 

FROM (SELECT DISTINCT CustomerName FROM DataFrame) AS t1 
INNER JOIN DataFrame AS t2 
ON t1.CustomerName = t2.CustomerName 
INNER JOIN DataFrame AS t3 
ON t1.CustomerName = t3.CustomerName 
INNER JOIN DataFrame AS t4 
ON t1.CustomerName = t4.CustomerName 

WHERE (t2.OrderSequence=1) AND (t3.OrderSequence=2) AND (t4.OrderSequence=3); 
+0

謝謝,我會嘗試運行你的邏輯,看看它在我的數據上運行得更快! – Tanya

+0

我們在實際數據中發現了什麼? – Parfait