2016-04-03 43 views
0

我有兩個數據幀,一個是用戶 - 項目評級,而另一個是所述物品的側信息:如何將python-pandas的兩個數據框分解?

#df1 
A12VH45Q3H5R5I B000NWJTKW 5.0 
A3J8AQWNNI3WSN B000NWJTKW 4.0 
A1XOBWIL4MILVM BDASK99000 1.0 

#df2 
B000NWJTKW .... 
BDASK99000 .... 

現在我w'd喜歡映射項和用戶的名稱爲整數ID。我知道有一種factorize方式:

df.apply(lambda x: pd.factorize(x)[0] + 1) 

但我想,以確保兩個數據幀中的項目整數是一致的。所以得到的數據幀是:

#df1 
1  1  5.0 
2  1  4.0 
3  2  1.0 

#df2 
1  ... 
2  ... 

你知道如何確保?提前致謝!

回答

2

級聯這些公共列(多個),並應用在該pd.factorize(或pd.Categorical):

codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']])) 
df1['item'] = codes[:len(df1)] + 1 
df2['item'] = codes[len(df1):] + 1 

例如,

import pandas as pd 

df1 = pd.DataFrame(
[('A12VH45Q3H5R5I', 'B000NWJTKW', 5.0), 
('A3J8AQWNNI3WSN', 'B000NWJTKW', 4.0), 
('A1XOBWIL4MILVM', 'BDASK99000', 1.0)], columns=['user', 'item', 'rating']) 

df2 = pd.DataFrame(
[('B000NWJTKW', 10), 
('BDASK99000', 20)], columns=['item', 'extra']) 

codes, uniques = pd.factorize(pd.concat([df1['item'], df2['item']])) 
df1['item'] = codes[:len(df1)] + 1 
df2['item'] = codes[len(df1):] + 1 

codes, uniques = pd.factorize(df1['user']) 
df1['user'] = codes + 1 

print(df1) 
print(df2) 

產生

# df1 
    user item rating 
0  1  1  5 
1  2  1  4 
2  3  2  1 

# df2 
    item extra 
0  1  10 
1  2  20 

另一上班途中,解決該問題(如果你有足夠的內存),將兩個DataFrames合併:df3 = pd.merge(df1, df2, on='item', how='outer'),然後比化df3['item']

df3 = pd.merge(df1, df2, on='item', how='outer') 
for col in ['item', 'user']: 
    df3[col] = pd.factorize(df3[col])[0] + 1 
print(df3) 

產生

user item rating extra 
0  1  1  5  10 
1  2  1  4  10 
2  3  2  1  20 
相關問題