2017-02-24 52 views
0

我有一個聊天數據集,我想創建一個會話組並統計他們發送的消息數量。python羣聊ID

這是我的數據。該數據是 「ID」的聊天記錄,其名稱是Jimmy。

Sender  Receiver Text 
ID   person1 HI 
person1  ID   Hello~ 
ID   person1 My name is Jimmy 
person1  ID   Nice to meet you! 
ID   person1 Nice to meet you, too 
ID   person2 Hi 
person1  ID   Hi there 
ID   person2 My name is Jimmy 
person1  ID   My name is Abi 
ID   person2 Nice to meet you 
...   ....  ..... 

「ID」可以與多個人聊天。
我想要計算每個對話的消息數量。
在這種情況下,兩個對話都有5條消息。

我已經編寫了代碼,但由於我的數據很大,所以看起來效率很低。

#chat_df is the dataframe of chat data 
    df = [] 
    total_message =[] 
    receiver_id = chat_df["receiver"].unique() 
    for x in rid: 
     total_message.append(len(chat_df[(chat_df["receiver"] == x) | (chat_df["sender"] == x)])) 
     df.append(chat_df[(chat_df["receiver"] == x) | (chat_df["sender"] == x)]) 

有沒有一種更有效的方法來獲得一對雙人的聊天數據?

回答

1

我認爲你需要stackvalue_counts

df1 = chat_df[['Sender','Receiver']].stack().value_counts().reset_index() 
df1.columns = ['People','Counts'] 
print (df1) 
    People Counts 
0  ID  10 
1 person1  7 
2 person2  3 

編輯:

#get number of all words 
chat_df['Len'] = chat_df.Text.str.split().str.len() 
#reshape dataframe 
chat_df = chat_df.set_index('Len')[['Sender','Receiver']].stack().reset_index(name='People') 
print (chat_df) 
    Len level_1 People 
0  1 Sender  ID 
1  1 Receiver person1 
2  1 Sender person1 
3  1 Receiver  ID 
4  4 Sender  ID 
5  4 Receiver person1 
6  4 Sender person1 
7  4 Receiver  ID 
8  5 Sender  ID 
9  5 Receiver person1 
10 1 Sender  ID 
11 1 Receiver person2 
12 2 Sender person1 
13 2 Receiver  ID 
14 4 Sender  ID 
15 4 Receiver person2 
16 4 Sender person1 
17 4 Receiver  ID 
18 4 Sender  ID 
19 4 Receiver person2 

#groupby by People and aggregate sum and size 
chat_df1 = chat_df.groupby('People')['Len'].agg(['size','sum']) 
chat_df1.columns = ['Count','Len_words'] 
chat_df1 = chat_df1.reset_index() 
#filter all sizes higher as 5 
chat_df1 = chat_df1[chat_df1['Count'] > 5] 
print (chat_df1) 
    People Count Len_words 
0  ID  10   30 
1 person1  7   21 
+0

謝謝!這就是我需要的! 還有一個問題.. 如果我想計算每條消息的文本數量,以便更高的計數(5位以上),你會如何建議完成它? 非常感謝你! – jimmy15923

+0

謝謝。我正在考慮你的第二個問題,我認爲沒有更好的解決方案,因爲['boolean indexing'](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-索引)。 – jezrael

+0

什麼意思是文本的數量?數字?或短信的長度? – jezrael