使用複合鍵將熊貓數據幀轉換爲稀疏鍵 - 項矩陣

我有一個3列的數據框。列1是字符串訂單號，列2是整數日，列3是產品名稱。我想將其轉換爲矩陣，其中每行表示一個唯一的訂單/日組合，並且每個列代表1/0表示該組合的產品名稱的存在。使用複合鍵將熊貓數據幀轉換爲稀疏鍵 - 項矩陣

到目前爲止，我的方法使用了產品字典和帶有訂單＃&組合鍵的字典。爲了將矩陣中的比特翻轉爲1，最後一步是遍歷原始數據幀。像矩陣大小爲363K×331的10分鐘，稀疏度爲〜97％。

我應該考慮一種不同的方法嗎？

例如，

ord_nb day prod 
1 1 A 
1 1 B 
1 2 B 
1 2 C 
1 2 D

將成爲

A B C D 
1 1 0 0 
0 1 1 1

我的做法是創建訂單/天對的字典：

ord_day_dict = {} 
print("Making a dictionary of ord-by-day keys...") 
gp = df.groupby(['day', 'ord']) 
for i,g in enumerate(gp.groups.items()): 
    ord_day_dict[g[0][0], g[0][1]] = i

我該指數表示形式追加到原始數據幀：

df['ord_day_idx'] = 0 #Create a place holder column 
for i, row in df.iterrows(): #populate the column with the index 
    df.set_value(i,'ord_day_idx',ord_day_dict[(row['day'], row['ord_nb'])])

我然後初始化一個矩陣的我ORD /天X的獨特產品尺寸：

n_items = df.prod_nm.unique().shape[0] #unique number of products 
n_ord_days = len(ord_day_dict) #unique number of ord-by-day combos 
df_fac_matrix = np.zeros((n_ord_days, n_items), dtype=np.float64)#-1)

將我的產品串入索引通過詞典：

prod_dict = dict() 
i = 0 
for v in df.prod: 
    if v not in prod_dict: 
     prod_dict[v] = i 
     i = i + 1

最後遍歷原始數據框以用1填充矩陣，其中特定日期的特定順序包括特定產品。

for line in df.itertuples(): 
    df_fac_matrix[line[4], line[3]] = 1.0 #in the order-by-day index row and the product index column of our ord/day-by-prod matrix, mark a 1

來源

2016-11-16 Amw 5G

這裏有一個NumPy的基礎的方法有一個數組作爲輸出 -

a = df[['ord_nb','day']].values.astype(int) 
row = np.unique(np.ravel_multi_index(a.T,a.max(0)+1),return_inverse=1)[1] 
col = np.unique(df.prd.values,return_inverse=1)[1] 
out_shp = row.max()+1, col.max()+1 
out = np.zeros(out_shp, dtype=int) 
out[row,col] = 1

請注意，第三列被假定爲而不是名字'prd'，以避免名稱與內置衝突。性能

可能改進重點 -

如果prd有單個字母字符只能從A開始，我們可以計算col用簡單：df.prd.values.astype('S1').view('uint8')-65。
或者，我們可以用np.unique(a[:,0]*(a[:,1].max()+1) + a[:,1],return_inverse=1)[1]來計算row。

節省內存稀疏陣列：對於真正巨大的陣列，我們可以通過存儲它們作爲稀疏矩陣的內存保存。因此，最終的步驟，以得到這樣一個稀疏矩陣是 -

from scipy.sparse import coo_matrix 

d = np.ones(row.size,dtype=int) 
out_sparse = coo_matrix((d,(row,col)), shape=out_shp)

樣品輸入，輸出 -

In [232]: df 
Out[232]: 
    ord_nb day prd 
0  1 1 A 
1  1 1 B 
2  1 2 B 
3  1 2 C 
4  1 2 D 

In [233]: out 
Out[233]: 
array([[1, 1, 0, 0], 
     [0, 1, 1, 1]]) 

In [241]: out_sparse 
Out[241]: 
<2x4 sparse matrix of type '<type 'numpy.int64'>' 
    with 5 stored elements in COOrdinate format> 

In [242]: out_sparse.toarray() 
Out[242]: 
array([[1, 1, 0, 0], 
     [0, 1, 1, 1]])

來源

2016-11-16 21:11:55 Divakar

這裏是一個選項，您可以嘗試：

df.groupby(['ord_nb', 'day'])['prod'].apply(list).apply(lambda x: pd.Series(1, x)).fillna(0) 

#    A B C D 
#ord_nb day    
#  1 1 1.0 1.0 0.0 0.0 
#   2 0.0 1.0 1.0 1.0

來源

2016-11-16 20:21:12 Psidom

使用複合鍵將熊貓數據幀轉換爲稀疏鍵 - 項矩陣

回答

相關問題