2015-11-06 66 views
1

我有一個數據幀topic_data包含一個LDA主題模型的輸出:用於生成排名靠前的值的列在熊貓

topic_data.head(15) 

    topic      word  score 
0  0    Automobile 0.063986 
1  0     Vehicle 0.017457 
2  0    Horsepower 0.015675 
3  0     Engine 0.014857 
4  0     Bicycle 0.013919 
5  1      Sport 0.032938 
6  1  Association_football 0.025324 
7  1    Basketball 0.020949 
8  1     Baseball 0.016935 
9  1 National_Football_League 0.016597 
10  2      Japan 0.051454 
11  2      Beer 0.032839 
12  2     Alcohol 0.027909 
13  2      Drink 0.019494 
14  2      Vodka 0.017908 

這顯示前5項爲每個主題,和評分(重量)每。我想要做的是重新格式化,以便索引是術語的排名,列是主題ID,值是從wordscore列(沿線爲"%s (%.02f)" % (word,score))生成的格式化字符串。這意味着新的數據框應該看起來像這樣:

Topic 0    1       ... 
Rank 
    0 Automobile (0.06) Sport (0.03)     ... 
    1 Vehicle (0.017) Association_football (0.03) ... 
... ...    ...       ... 

什麼是正確的方式去做這件事?我認爲它涉及索引設置,取消堆棧和排名的組合,但我不確定正確的方法。

回答

2

這將是這樣的,請注意,Rank必須首先生成:

In [140]: 
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort) 
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format) 
df2   = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']] 
print df2.pivot(index='Rank', values='New_str', columns='topic') 

topic     0        1    2 
Rank                  
0  Automobile (0.06)      Sport (0.03) Japan (0.05) 
1   Vehicle (0.02)  Association_football (0.03)  Beer (0.03) 
2  Horsepower (0.02)    Basketball (0.02) Alcohol (0.03) 
3   Engine (0.01)     Baseball (0.02) Drink (0.02) 
4   Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)