2016-08-30 331 views
4

我想按單詞在熊貓數據框上進行彙總。如何在熊貓數據框中按單詞分組統計

基本上有3列與點擊/印象計數與相應的短語。我想將這個短語拆分爲令牌,然後將它們的點擊總結爲令牌,以確定哪個令牌相對好/不好。

預期輸入:數據幀熊貓如下

click_count impression_count text 
1 10   100     pizza 
2 20   200     pizza italian 
3 1   1     italian cheese 

預期輸出:

click_count impression_count token 
1 30   300    pizza  // 30 = 20 + 10, 300 = 200+100   
2 21   201    italian // 21 = 20 + 1 
3 1   1     cheese  // cheese only appeared once in italian cheese 

回答

1
tokens = df.text.str.split(expand=True) 
token_cols = ['token_{}'.format(i) for i in range(tokens.shape[1])] 
tokens.columns = token_cols 

df1 = pd.concat([df.drop('text', axis=1), tokens], axis=1) 
df1 

enter image description here

df2 = pd.lreshape(df1, {'tokens': token_cols}) 
df2 

enter image description here

df2.groupby('tokens').sum() 

enter image description here

1

這將創建一個新的數據幀像piRSquared的,但令牌堆疊並與原來的合併:

(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True) 
      .to_frame('token').merge(df, left_index=True, right_index=True) 
      .groupby('token')['click_count', 'impression_count'].sum()) 
Out: 
     click_count impression_count 
token         
cheese    1     1 
italian   21    201 
pizza    30    300 

如果你打破下來,它結合了這一點:

df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True).to_frame('token') 
Out: 
    token 
1 pizza 
2 pizza 
2 italian 
3 italian 
3 cheese 

with t他原來的DataFrame在他們的指數上。由此產生的DF是:

(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True) 
      .to_frame('token').merge(df, left_index=True, right_index=True)) 
Out: 
    token click_count impression_count   text 
1 pizza   10    100   pizza 
2 pizza   20    200 pizza italian 
2 italian   20    200 pizza italian 
3 italian   1     1 italian cheese 
3 cheese   1     1 italian cheese 

其餘的是按標記列分組。

0

你可以做

In [3091]: s = df.text.str.split(expand=True).stack().reset_index(drop=True, level=-1) 

In [3092]: df.loc[s.index].assign(token=s).groupby('token',sort=False,as_index=False).sum() 
Out[3092]: 
    token click_count impression_count 
0 pizza   30    300 
1 italian   21    201 
2 cheese   1     1 

詳細

In [3093]: df 
Out[3093]: 
    click_count impression_count   text 
1   10    100   pizza 
2   20    200 pizza italian 
3   1     1 italian cheese 

In [3094]: s 
Out[3094]: 
1  pizza 
2  pizza 
2 italian 
3 italian 
3  cheese 
dtype: object