2017-06-07 62 views
0

這裏是我的樣本數據是這樣的:的N-gram分析在Python

enter image description here

我需要進行1-2克上查詢,並計算與查詢相關的總和與印象的平均。現在我已經想出瞭如何使用下面的代碼來彙總展示次數。

def n_grams(txt): 
grams = list() 
words = txt.split(' ') 
for i in range(len(words)): 
    for k in range(1, len(words) - i + 1): 
     grams.append(" ".join(words[i:i+k])) 
return pd.Series(grams) 


counts = df['query'].apply(n_grams).join(df) 
result = counts.drop("query", axis=1).set_index("impression").unstack() .rename("ngram").dropna().reset_index() .drop("level_0", 
axis=1).groupby("ngram")["impression"].sum() 
result = result.to_frame() 
result['query'] = result.index 
result['ngram'] =result['query'].str.split().apply(len) 
result = result.groupby(['ngram','query'])['impression'].sum() 
result = result.reset_index() 
result = result.sort_values(['ngram', 'impression'], ascending=[True, False]) 

返回的結果一樣:

enter image description here

在這裏,我需要一個又一個欄,顯示與這些查詢相關的平均印象。例如,「營養」一詞出現四次,所以平均印象應該是100/4 = 25.另外,我想顯示此查詢在另一列中出現的次數。最終結果應該如下所示:enter image description here

回答

0

此代碼將幫助您計算來自bigrams的unigrams的數量,如'營養'。

2gram=result[result['ngram']==2] 
2gram=2gram.reset_index() 
#create an empty dictionary to store count of words in bigrams 
words=dict() 
for i in range(0,len(2gram): 
    query_wrds=2gram.loc[i,'query'].split() 
     for item in query_words: 
      if item not in words: 
       words[item]=1 
      else: 
       words[item]+=1 
#to get count of word 'nutrition' 
nut_ct=words['nutrition']