2017-04-23 67 views
2

我使用Python3和pandas版本'0.19.2'。熊貓在字符串列上滾動總和

我有一個熊貓DF如下:

chat_id line 
1   'Hi.' 
1   'Hi, how are you?.' 
1   'I'm well, thanks.' 
2   'Is it going to rain?.' 
2   'No, I don't think so.' 

我想組由「chat_id」,然後做一些像「線」滾動總和得到如下:

chat_id line      conversation 
1   'Hi.'     'Hi.' 
1   'Hi, how are you?.'  'Hi. Hi, how are you?.' 
1   'I'm well, thanks.'  'Hi. Hi, how are you?. I'm well, thanks.' 
2   'Is it going to rain?.' 'Is it going to rain?.' 
2   'No, I don't think so.' 'Is it going to rain?. No, I don't think so.' 

我相信df.groupby('chat_id')['line']。cumsum()只適用於數字列。

我也試圖df.groupby(由= [「chat_id」],as_index =假)「行」]。應用(列表)來獲得完整的會話中的所有行的列表,但後來我無法弄清楚如何解開該列表以創建「滾動總和」式對話欄。

+0

有趣。如果您在Series上調用'cumsum',但在groupby對象上調用時會引發錯誤。 – ayhan

回答

0

對我的作品applySeries.cumsum,如果需要添加分隔space

df['new'] = df.groupby('chat_id')['line'].apply(lambda x: (x + ' ').cumsum().str.strip()) 
print (df) 
    chat_id     line           new 
0  1     Hi.           Hi. 
1  1  Hi, how are you?.      Hi. Hi, how are you?. 
2  1  I'm well, thanks.  Hi. Hi, how are you?. I'm well, thanks. 
3  2 Is it going to rain?.      Is it going to rain?. 
4  2 No, I don't think so. Is it going to rain?. No, I don't think so. 

df['line'] = df['line'].str.strip("'") 
df['new'] = df.groupby('chat_id')['line'].apply(lambda x: "'" + (x + ' ').cumsum().str.strip() + "'") 
print (df) 
    chat_id     line \ 
0  1     Hi. 
1  1  Hi, how are you?. 
2  1  I'm well, thanks. 
3  2 Is it going to rain?. 
4  2 No, I don't think so. 

              new 
0           'Hi.' 
1      'Hi. Hi, how are you?.' 
2  'Hi. Hi, how are you?. I'm well, thanks.' 
3      'Is it going to rain?.' 
4 'Is it going to rain?. No, I don't think so.' 
+0

對我而言,結果爲: ValueError:無法從重複軸重新索引 – user3591836

+0

什麼是您的熊貓版本? 'print(pd.show_versions())'。因爲我無法模擬你的錯誤。我測試重複值的值,重複索引和所有完美的版本'0.19.2'。 – jezrael

+0

對不起,你是對的。我必須在df上重新設置reset_index(),然後才能正常工作。 – user3591836