2017-01-10 42 views
2

如何替換pandas.DataFrame中很少發生的某些列的值,即頻率較低(忽略NaN)?從pandas.dataframe替換低頻分類值,同時忽略NaN

例如,在下面的數據框中,假設我想要替換列A或B中在其各自列中發生少於三次的任何值。我想用「其他」來替換這些:

import pandas as pd 
import numpy as np 

df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']}) 
df 
    A | B | C | 
---------------------- 
ant | cat | dog | 
ant | peach | dog | 
cherry | cat | NaN | 
NaN | cat | emu | 
ant | peach | emu | 

換句話說,在列A和B,我想,以取代那些發生兩次或以下的值(但獨自離開NaN的)。

所以我想輸出是:

A | B | C | 
---------------------- 
ant | cat | dog | 
ant | other | dog | 
other | cat | NaN | 
NaN | cat | emu | 
ant | other | emu | 

這與先前發佈的問題:Remove low frequency values from pandas.dataframe

但解決方案存在導致了「AttributeError的:‘NoneType’對象有沒有屬性' 。任何'」(?我想是因爲我有NaN值)

回答

2

這與Change values in pandas dataframe according to value_counts()非常相似。您可以向lambda函數添加條件以排除列'C',如下所示:

df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x) 
Out: 
     A  B C 
0 ant cat dog 
1 ant other dog 
2 other cat NaN 
3 NaN cat emu 
4 ant other emu 

這基本上是對列進行迭代。對於每一列,它會生成值計數並使用該系列進行映射。這允許x.mask檢查條件計數是否小於3。如果是這樣的話,它返回'其他',如果沒有,它使用實際值。最後,一個條件檢查列名稱。

通過將lambda的條件更改爲x.name not in 'CDEF'x.name not in ['C', 'D', 'E', 'F']x.name!='C'可以推廣到多列。

1

您可以使用:

#added one last row for complicated df 
df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant', 'd'], 
        'B':['cat','peach', 'cat', 'cat', 'peach', 'm'], 
        'C':['dog','dog',pd.np.nan, 'emu', 'emu', 'k']}) 
print (df) 
     A  B C 
0  ant cat dog 
1  ant peach dog 
2 cherry cat NaN 
3  NaN cat emu 
4  ant peach emu 
5  d  m k 

使用value_countsboolean indexing對找到的所有值替換:

a = df.A.value_counts() 
a = a[a < 3].index 
print (a) 
Index(['cherry', 'd'], dtype='object') 

b = df.B.value_counts() 
b = b[b < 3].index 
print (b) 
Index(['peach', 'm'], dtype='object') 

然後用dict comprehensionreplace如果有更多的值替換:

df.A = df.A.replace({x:'other' for x in a}) 
df.B = df.B.replace({x:'other' for x in b}) 
print (df) 
     A  B C 
0 ant cat dog 
1 ant other dog 
2 other cat NaN 
3 NaN cat emu 
4 ant other emu 
5 other other k 

都聚集在循環:

cols = ['A','B'] 
for col in cols: 
    val = df[col].value_counts() 
    y = val[val < 3].index 
    df[col] = df[col].replace({x:'other' for x in y}) 
print (df) 
     A  B C 
0 ant cat dog 
1 ant other dog 
2 other cat NaN 
3 NaN cat emu 
4 ant other emu 
5 other other k 
+0

嗯,所以此工程在這個樣本DF,但是當我試圖用我的實際數據要做到這一點,我得到一個錯誤與更換W /字典修真線:ValueError:沒有足夠的值來解壓縮(預期2,得到0)。任何想法可能會發生在那裏? – Imu

+0

我不確定,也許有必要轉換爲列表 - 'df [col] = df [col] .replace({x:'other'for x in y.tolist()})' – jezrael

2

使用輔助功能和replace

def replace_low_freq(df, threshold=2, replacement='other'): 
    s = df.stack() 
    c = s.value_counts() 
    m = pd.Series(replacement, c.index[c <= threshold]) 
    return s.replace(m).unstack() 

cols = list('AB') 
replace_low_freq(df[cols]).join(df.drop(cols, 1)) 

     A  B C 
0 ant cat dog 
1 ant other dog 
2 other cat NaN 
3 None cat emu 
4 ant other emu 
+0

Nice clean solution +1 – ade1e