熊貓降低分類變量

新的號碼大熊貓我要（以分類變量的分級，以減少他們的水平）執行類似於Reduce number of levels for large categorical variables東西下面的代碼工作中的R熊貓降低分類變量

DTsetlvls <- function(x, newl) 
    setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))

我的數據框罰款：

df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 
        'Value': [100, 150, 50]}) 

df['Counts'] = df.groupby('Color')['Value'].transform('count') 
print (df) 

    Color Value Counts 
0 Red 100  2 
1 Red 150  2 
2 Blue  50  1

我手動創建一個聚合列，然後基於此，標記較不頻繁的組，例如「藍色」作爲單個「其他」組。但與簡潔的R代碼相比，這看起來很笨拙。這裏的正確方法是什麼？

來源

2016-08-23 Georg Heiler

可能[如何將「剩餘的」結果分組到上N以外的結果複製到「O」中thers「with pandas]（http://stackoverflow.com/questions/19835746/how-to-group-remaining-results-beyond-top-n-into-others-with-pandas） –

我認爲你可以使用value_counts與numpy.where，這裏是條件與isin：

df = pd.DataFrame({'Color':'Red Red Blue Red Violet Blue'.split(), 
        'Value':[11,150,50,30,10,40]}) 
print (df) 
    Color Value 
0  Red  11 
1  Red 150 
2 Blue  50 
3  Red  30 
4 Violet  10 
5 Blue  40 

a = df.Color.value_counts() 
print (a) 
Red  3 
Blue  2 
Violet 1 
Name: Color, dtype: int64 

#get top 2 values of index 
vals = a[:2].index 
print (vals) 
Index(['Red', 'Blue'], dtype='object')

df['new'] = np.where(df.Color.isin(vals), 0,1) 
print (df) 
    Color Value new 
0  Red  11 0 
1  Red 150 0 
2 Blue  50 0 
3  Red  30 0 
4 Violet  10 1 
5 Blue  40 0

或者，如果需要更換所有不頂值使用where：

df['new1'] = df.Color.where(df.Color.isin(vals), 'other') 
print (df) 
    Color Value new1 
0  Red  11 Red 
1  Red 150 Red 
2 Blue  50 Blue 
3  Red  30 Red 
4 Violet  10 other 
5 Blue  40 Blue

來源

2016-08-23 11:30:08 jezrael

熊貓降低分類變量

回答

相關問題