在市場購物籃中計算獨特的組合頻率

我有一組1000000個市場籃子，每個市場籃子包含1-4個項目。我想計算每個獨特組合購買的頻率。在市場購物籃中計算獨特的組合頻率

的數據被組織成這樣：

[in] print(training_df.head(n=5)) 

[out]      product_id 
transaction_id      
0000001     [P06, P09] 
0000002   [P01, P05, P06, P09] 
0000003     [P01, P06] 
0000004     [P01, P09] 
0000005     [P06, P09]

在這個例子中[P06，P09]具有2的頻率和所有其它組合具有爲1的頻率。我已經創建瞭如下的二進制矩陣和計算爲這樣的各個項目的頻率：

# Create a matrix for the transactions 
from sklearn.preprocessing import MultiLabelBinarizer 

product_ids = ['P{:02d}'.format(i+1) for i in range(10)] 

mlb = MultiLabelBinarizer(classes = product_ids) 
training_df1 = training_df.drop('product_id', 1).join(pd.DataFrame(mlb.fit_transform(training_df['product_id']), 
          columns=mlb.classes_, 
          index=training_df.index)) 

# Calculate the support count for each product (frequency) 
train_product_support = {} 
for column in training_df1.columns: 
    train_product_support[column] = sum(training_df1[column]>0)

如何計算的1-4項存在於所述數據中的每個唯一組合的頻率是多少？

來源

2017-08-01 zsad512

那麼，既然你不能使用df.groupby('product_id').count()，這是我能想到的最好的。我們使用列表的字符串表示形式作爲關鍵字，並對其中的事件進行計數。

counts = dict() 
for i in df['product_id']: 
    key = i.__repr__() 
    if key in counts: 
     counts[key] += 1 
    else: 
     counts[key] = 1

來源

2017-08-01 20:17:38 jacoblaw

這就是我將如何解決這個問題，但我猜想順序無關緊要。因此，我會拋出'key = sorted（key）'來進行相同項目的任何排列 –

'defaultdict'可能更適合與https://docs.python.org/3/library/collections.html collections.defaultdict – dashiell

可能還需要一個'frozenset'而不是'str' https://docs.python.org/3/library/stdtypes.html#frozenset – dashiell

也許：

df['frozensets'] = df.apply(lambda row: frozenset(row.product_id),axis=1) 
df['frozensets'].value_counts()

創建frozensets從product_ids柱（可哈希，並且忽略排序），然後計數每個唯一值的數目。

來源

2017-08-01 20:42:51 dashiell

這樣可以將數據從最高頻率排序到最低頻率（具有獨特組合）。如何根據數字對獨特組合進行進一步排序組合中的項目？ – zsad512

在市場購物籃中計算獨特的組合頻率

回答

相關問題