轉化離羣使用。適用，.applymap，.groupby

我試圖改變一個大熊貓據幀對象到包含點基於一些簡單的閾值分類的新對象：轉化離羣使用。適用，.applymap，.groupby

值轉換到0如果點是NaN
值轉換到1如果點爲負或0
值轉換到2如果它落在基於整個塔外某些標準
值是3否則

這是一個非常簡單的自包含例如：

import pandas as pd 
import numpy as np 

df=pd.DataFrame({'a':[np.nan,1000000,3,4,5,0,-7,9,10],'b':[2,3,-4,5,6,1000000,7,9,np.nan]}) 

print(df)

enter image description here

至今已創造的轉型過程：

#Loop through and find points greater than the mean -- in this simple example, these are the 'outliers' 
outliers = pd.DataFrame() 
for datapoint in df.columns: 
    tempser = pd.DataFrame(df[datapoint][np.abs(df[datapoint]) > (df[datapoint].mean())]) 
    outliers = pd.merge(outliers, tempser, right_index=True, left_index=True, how='outer') 

outliers[outliers.isnull() == False] = 2 


#Classify everything else as "3" 
df[df > 0] = 3 

#Classify negative and zero points as a "1" 
df[df <= 0] = 1 

#Update with the outliers 
df.update(outliers) 

#Everything else is a "0" 
df.fillna(value=0, inplace=True)

，導致：

enter image description here

我曾嘗試使用.applymap()和/或.groupby()爲了加快與沒有運氣的過程。我發現了一些指導this answer不過，我仍然不能確定，當你不是熊貓列中分組.groupby()如何是非常有用的。

來源

2015-06-23 cmiller8

下面是異常值部分的替代品。我的電腦上的樣本數據速度快了5倍。

>>> pd.DataFrame(np.where(np.abs(df) > df.mean(), 2, df), columns=df.columns) 

    a b 
0 NaN 2 
1 2 3 
2 3 -4 
3 4 5 
4 5 6 
5 0 2 
6 -7 7 
7 9 9 
8 10 NaN

你可以這樣做，也是與應用，但它會比np.where方法（但大致相同的速度你目前在做什麼）更慢，但要簡單得多。這可能是爲什麼你應該總是避免apply如果可能的話，當你關心速度一個很好的例子。

>>> df[ df.apply(lambda x: abs(x) > x.mean()) ] = 2

你也可以做到這一點，這是比apply比np.where快，但速度慢：

>>> mask = np.abs(df) > df.mean() 
>>> df[mask] = 2

當然，這些事情並不總是線性擴展，所以測試他們對你的真實數據和看看如何比較。

來源

2015-06-23 15:36:27 JohnE

對於局外人而言，我只希望值與'2'被替換時，他們只能滿足條件語句的列，**不是整個數據框** - 我認爲您的解決方案使用整個數據幀？ – cmiller8

@ cmiller8不，它是每列。鍵入'df.mean（）'，你會看到它給你每列的意思。你也可以嘗試一些不同的樣本數據來測試它。 – JohnE

你是對的！和你的方法是300X快10K列，25K行數據幀 – cmiller8

轉化離羣使用。適用，.applymap，.groupby

回答

相關問題