從pandas.dataframe替換低頻分類值，同時忽略NaN

如何替換pandas.DataFrame中很少發生的某些列的值，即頻率較低（忽略NaN）？從pandas.dataframe替換低頻分類值，同時忽略NaN

例如，在下面的數據框中，假設我想要替換列A或B中在其各自列中發生少於三次的任何值。我想用「其他」來替換這些：

import pandas as pd 
import numpy as np 

df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']}) 
df 
    A | B | C | 
---------------------- 
ant | cat | dog | 
ant | peach | dog | 
cherry | cat | NaN | 
NaN | cat | emu | 
ant | peach | emu |

換句話說，在列A和B，我想，以取代那些發生兩次或以下的值（但獨自離開NaN的）。

所以我想輸出是：

A | B | C | 
---------------------- 
ant | cat | dog | 
ant | other | dog | 
other | cat | NaN | 
NaN | cat | emu | 
ant | other | emu |

這與先前發佈的問題：Remove low frequency values from pandas.dataframe

但解決方案存在導致了「AttributeError的：‘NoneType’對象有沒有屬性' 。任何'」（？我想是因爲我有NaN值）

來源

2017-01-10 Imu

這與Change values in pandas dataframe according to value_counts()非常相似。您可以向lambda函數添加條件以排除列'C'，如下所示：

df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x) 
Out: 
     A  B C 
0 ant cat dog 
1 ant other dog 
2 other cat NaN 
3 NaN cat emu 
4 ant other emu

這基本上是對列進行迭代。對於每一列，它會生成值計數並使用該系列進行映射。這允許x.mask檢查條件計數是否小於3。如果是這樣的話，它返回'其他'，如果沒有，它使用實際值。最後，一個條件檢查列名稱。

通過將lambda的條件更改爲x.name not in 'CDEF'或x.name not in ['C', 'D', 'E', 'F']從x.name!='C'可以推廣到多列。

來源

2017-01-10 20:33:48 ayhan

您可以使用：

#added one last row for complicated df 
df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant', 'd'], 
        'B':['cat','peach', 'cat', 'cat', 'peach', 'm'], 
        'C':['dog','dog',pd.np.nan, 'emu', 'emu', 'k']}) 
print (df) 
     A  B C 
0  ant cat dog 
1  ant peach dog 
2 cherry cat NaN 
3  NaN cat emu 
4  ant peach emu 
5  d  m k

使用value_counts與boolean indexing對找到的所有值替換：

a = df.A.value_counts() 
a = a[a < 3].index 
print (a) 
Index(['cherry', 'd'], dtype='object') 

b = df.B.value_counts() 
b = b[b < 3].index 
print (b) 
Index(['peach', 'm'], dtype='object')

然後用dict comprehensionreplace如果有更多的值替換：

df.A = df.A.replace({x:'other' for x in a}) 
df.B = df.B.replace({x:'other' for x in b}) 
print (df) 
     A  B C 
0 ant cat dog 
1 ant other dog 
2 other cat NaN 
3 NaN cat emu 
4 ant other emu 
5 other other k

都聚集在循環：

cols = ['A','B'] 
for col in cols: 
    val = df[col].value_counts() 
    y = val[val < 3].index 
    df[col] = df[col].replace({x:'other' for x in y}) 
print (df) 
     A  B C 
0 ant cat dog 
1 ant other dog 
2 other cat NaN 
3 NaN cat emu 
4 ant other emu 
5 other other k

來源

2017-01-10 20:13:52 jezrael

嗯，所以此工程在這個樣本DF，但是當我試圖用我的實際數據要做到這一點，我得到一個錯誤與更換W /字典修真線：ValueError：沒有足夠的值來解壓縮（預期2，得到0）。任何想法可能會發生在那裏？ – Imu

我不確定，也許有必要轉換爲列表 - 'df [col] = df [col] .replace（{x：'other'for x in y.tolist（）}）' – jezrael

使用輔助功能和replace

def replace_low_freq(df, threshold=2, replacement='other'): 
    s = df.stack() 
    c = s.value_counts() 
    m = pd.Series(replacement, c.index[c <= threshold]) 
    return s.replace(m).unstack() 

cols = list('AB') 
replace_low_freq(df[cols]).join(df.drop(cols, 1)) 

     A  B C 
0 ant cat dog 
1 ant other dog 
2 other cat NaN 
3 None cat emu 
4 ant other emu

來源

2017-01-10 20:34:51 piRSquared

Nice clean solution +1 – ade1e

從pandas.dataframe替換低頻分類值，同時忽略NaN

回答

相關問題