2017-06-13 31 views
0

我想在應用groupby函數後使用列變量的標準偏差從熊貓數據框中刪除異常值。應用.groupby()爭論後用熊貓數據框中的NaN代替異常值

這是我的數據幀:

  ARI  Flesch Kincaid    Speaker  Score 
0  -2.090000 121.220000 -3.400000     NaN  NaN 
1  8.276460 64.478573 9.034156  William Dudley 1.670275 
2  19.570911 27.362067 17.253580  Janet Yellen -0.604757 
3  -2.090000 121.220000 -3.400000     NaN  NaN 
4  -2.090000 121.220000 -3.400000     NaN  NaN 
5  20.643483 17.069411 18.394178  Lael Brainard 0.215396 
6  -2.090000 121.220000 -3.400000     NaN  NaN 
7  -2.090000 121.220000 -3.400000     NaN  NaN 
8  12.624198 52.220468 11.403157 Jerome H. Powell -1.350798 
9  18.466305 35.186261 16.205693  Stanley Fischer 0.522121 
10 -2.090000 121.220000 -3.400000     NaN  NaN 
11 16.953460 36.246573 15.323457  Lael Brainard -0.217779 
12 -2.090000 121.220000 -3.400000     NaN  NaN 
13 -2.090000 121.220000 -3.400000     NaN  NaN 
14 17.066088 32.592551 16.108486  Stanley Fischer 0.642245 
15 -2.090000 121.220000 -3.400000     NaN  NaN 

我想第一組數據幀由「揚聲器」,然後除去「ARI」,「弗萊士」和「金凱德」值異常值所界定與特定特徵的平均值相比超過3個標準偏差。

請讓我知道這是否可能。謝謝!

+0

你可以把你的數據的片段,而不是附加圖像?人們更容易複製它。 – titipata

+0

更好嗎?謝謝! –

+0

完美,謝謝格雷厄姆。有人會很快解決它:) – titipata

回答

1

這種方法所需的唯一依賴是Pandas

假設我們已經取代了「揚聲器」列中的值「男」的東西代表像「CommitteOrganization」

speaker = dataset['Speaker'].fillna(value='CommitteeOrganization') dataset['Speaker'] = speaker

因此,我們有我們的數據如:

Index ARI Flesch Kincaid Speaker Score 
0 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN 
1 8.276460 64.478573 9.034156 WilliamDudley 1.670275 
2 19.570911 27.362067 17.253580 JanetYellen -0.604757 
3 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN 
4 -2.090000 121.220000 -3.400000 CommitteeOrganization NaN 

集團由熊貓功能:

datasetGrouped = dataset.groupby(by='Speaker').mean()

因此,我們有我們的數據,如:

Speaker    ARI Flesch Kincaid Score 
CommitteeOrganization -2.090000 121.220000 -3.400000 NaN 
JanetYellen 19.570911 27.362067 17.253580 -0.604757 
JeromeH.Powell 12.624198 52.220468 11.403157 -1.350798 
LaelBrainard 18.798471 26.657992 16.858818 -0.001191 
StanleyFischer 17.766196 33.889406 16.157089 0.582183 
WilliamDudley 8.276460 64.478573 9.034156 1.670275 

計算標準偏差爲每列:

aristd = datasetGrouped['ARI'].std() 
fleschstd = datasetGrouped['Flesch'].std() 
kincaidstd = datasetGrouped['Kincaid'].std() 

與替換值'NaN'滿足條件的行:

datasetGrouped.loc[abs(datasetGrouped.ARI) > aristd*3,'ARI'] = 'NaN' 
datasetGrouped.loc[abs(datasetGrouped.Flesch) > fleschstd*3,'Flesch'] = 'NaN' 
datasetGrouped.loc[abs(datasetGrouped.Kincaid) > kincaidstd*3,'Kincaid'] = 'NaN' 

最終的數據集:

Speaker    ARI Flesch Kincaid Score 
CommitteeOrganization -2.090000 NaN -3.400000 NaN 
JanetYellen 19.570911 27.3621 17.253580 -0.604757 
JeromeH.Powell 12.624198 52.2205 11.403157 -1.350798 
LaelBrainard 18.798471 26.658 16.858818 -0.001191 
StanleyFischer 17.766196 33.8894 16.157089 0.582183 
WilliamDudley 8.276460 64.4786 9.034156 1.670275 

的完整代碼可以用:Github

注:這可以在更短的代碼來完成所呈現的,但答案它做「步步「爲了便於理解。

注2:由於問題卻有點含糊,如果我沒有理解好東西,不提供正確的答案,請不要猶豫,告訴我,如果可能的話我會更新的答案

+0

謝謝!我的一個問題是標準偏差是否計算在所有「發言人」類型中。由於單個揚聲器在數據框中有多個條目,因此我想計算每個揚聲器的ARI,Flesch和Kincaid的標準偏差和均值,然後根據該特定揚聲器的標準偏差替換異常值。那有意義嗎?再次感謝! –

+0

個人發言者有多個條目,使用的方法是mean'datasetGrouped = dataset.groupby(by ='Speaker')。mean()' 這就是ARI,Flesch和Kincaid的值由Speaker數據集,是每個「發言人」的個人意思的平均值 – Alber8295

+0

太好了,謝謝 - 我明白現在發生了什麼! –