2017-04-10 58 views
0

我有以下數據的片材:從句子中提取數字並計算平均值。

team1,team2,outcome 
AA,BB,BB won by 90 runs 
AA,CC,AA won by 19 runs (D/L method) 
CC,BB,CC won by 26 runs (D/L method) 
AA,BB,BB won by 56 runs 
CC,BB,CC won by 18 runs 

我需要選擇的數值,並計算它們的平均通過TEAM1分組,TEAM2。

這是到現在爲止。很多垃圾數據,因此我只篩選貧困記錄!

df[df['outcome'].str.contains('runs',na=False)].head() 

我想要的結果:

team1 , team2 , AVG(NUMERIC COLUMN FROM 'OUTCOME') 

請建議!

回答

1

您可以使用extract與鑄造int第一,然後groupby和聚集mean

df.outcome = df.outcome.str.extract('(\d+)', expand=False).astype(int) 
print (df.groupby(['team1','team2'], as_index=False)['outcome'].mean()) 
    team1 team2 outcome 
0 AA BB  73 
1 AA CC  19 
2 CC BB  22 

類似的解決方案:

s = df.outcome.str.extract('(\d+)', expand=False).astype(int) 
print (s.groupby([df['team1'],df['team2']]).mean().reset_index()) 
    team1 team2 outcome 
0 AA BB  73 
1 AA CC  19 
2 CC BB  22 
+0

謝謝,我會嘗試了這一點。您能否告訴我們expand = False的意義是什麼? – ANI

+0

它只是警告,'FutureWarning:目前提取(展開=無)意味着expand = False(返回Index/Series/DataFrame),但在未來版本的熊貓中,這將改爲expand = True(返回DataFrame)' – jezrael

+0

偉大的,工作,感謝很多:) – ANI