2017-09-26 119 views
2

比方說,我們有以下數據集:熊貓:標準化組內

import pandas as pd 

data = [('apple', 'red', 155), ('apple', 'green', 102), ('apple', 'iphone', 48), 
     ('tomato', 'red', 175), ('tomato', 'ketchup', 96), ('tomato', 'gun', 12)] 

df = pd.DataFrame(data) 
df.columns = ['word', 'rel_word', 'weight'] 

enter image description here

我想重新計算權重,使他們每個組中總結到1.0(蘋果,番茄在例子中)並保持相關權重(例如蘋果/紅色蘋果/綠色仍然應該是155/102)。

+0

您可以添加所需的O-本安輸出? – jezrael

+0

請在單獨的專欄中提及預期輸出以便更好地理解 – JKC

回答

2

您可以使用groupby計算各組的總重量,然後apply標準化lambda函數每一行:

group_weights = df.groupby('word').aggregate(sum) 
df['normalized_weights'] = df.apply(lambda row: row['weight']/group_weights.loc[row['word']][0],axis=1) 

輸出:

word rel_word weight normalized_weights 
0 apple red   155  0.508197 
1 apple green  102  0.334426 
2 apple iphone  48  0.157377 
3 tomato red   175  0.618375 
4 tomato ketchup  96  0.339223 
+0

好的解決方案將命令式編程包裝到Pandas思維中。謝謝! –

2

使用transform - 比apply快並查找

In [3849]: df['weight']/df.groupby('word')['weight'].transform('sum') 
Out[3849]: 
0 0.508197 
1 0.334426 
2 0.157377 
3 0.618375 
4 0.339223 
5 0.042403 
Name: weight, dtype: float64 

In [3850]: df['norm_w'] = df['weight']/df.groupby('word')['weight'].transform('sum') 

In [3851]: df 
Out[3851]: 
    word rel_word weight norm_w 
0 apple  red  155 0.508197 
1 apple green  102 0.334426 
2 apple iphone  48 0.157377 
3 tomato  red  175 0.618375 
4 tomato ketchup  96 0.339223 
5 tomato  gun  12 0.042403 

或者,

In [3852]: df.groupby('word')['weight'].transform(lambda x: x/x.sum()) 
Out[3852]: 
0 0.508197 
1 0.334426 
2 0.157377 
3 0.618375 
4 0.339223 
5 0.042403 
Name: weight, dtype: float64 

時序

In [3862]: df.shape 
Out[3862]: (12000, 4) 

In [3864]: %timeit df['weight']/df.groupby('word')['weight'].transform('sum') 
100 loops, best of 3: 2.44 ms per loop 

In [3866]: %timeit df.groupby('word')['weight'].transform(lambda x: x/x.sum()) 
100 loops, best of 3: 5.16 ms per loop 

In [3868]: %%timeit 
     ...: group_weights = df.groupby('word').aggregate(sum) 
     ...: df.apply(lambda row: row['weight']/group_weights.loc[row['word']][0],axis=1) 
1 loop, best of 3: 2.5 s per loop 
+0

看起來更加智能和熊貓福的方式。謝謝! –

0

使用np.bincount & pd.factorize
這應該是非常快的,可擴展的

f, u = pd.factorize(df.word.values) 
w = df.weight.values 

df.assign(norm_w=w/np.bincount(f, w)[f]) 

    word rel_word weight norm_w 
0 apple  red  155 0.508197 
1 apple green  102 0.334426 
2 apple iphone  48 0.157377 
3 tomato  red  175 0.618375 
4 tomato ketchup  96 0.339223 
5 tomato  gun  12 0.042403