熊貓DataFrame應用功能加倍大小的DataFrame

我有一個熊貓數據框與數字數據。對於每個非二進制列，我想識別大於其第99百分位的值，並創建一個布爾掩碼，稍後我將使用它來刪除具有異常值的行。熊貓DataFrame應用功能加倍大小的DataFrame

我試圖創建使用apply方法，其中df是具有大小一個 * b的數值數據的數據幀這個布爾掩碼，如下所述。

def make_mask(s): 
    if s.unique().shape[0] == 2: # If binary, return all-false mask 
     return pd.Series(np.zeros(s.shape[0]), dtype=bool) 
    else: # Otherwise, identify outliers 
     return s >= np.percentile(s, 99) 

s_bool = df.apply(make_mask, axis=1)

不幸的是，s_bool是與兩倍多列的數據幀輸出（即，大小一個 * （B * 2））。第一個列被命名爲1,2,3等，並且填滿了空值。第二列似乎是預期的掩模。

爲什麼apply方法將DataFrame的大小加倍？不幸的是，熊貓apply documentation沒有提供有用的線索。

來源

2015-04-28 Gyan Veda

您確實張貼了正確的代碼？ 'raw = True'表示該函數傳遞給'ndarray'，而'ndarray'對象沒有'unique'方法。我嘗試了'raw = False'，它工作正常。 – TheBlackCat

我的不好，不應該指定'raw'參數，以便它隱含地設置爲'False'。當我根本沒有設置這個參數時，就會出現兩倍的列。 –

我試着用隨機DataFrame的新版本，並不能重現問題：'df = pd.DataFrame（np.random.random（（50,20）），columns = tuple（'abcdefghijklmnopqrstuvwxyz'[：20]），index = np.arange（0,5，.1））' – TheBlackCat

我不清楚爲什麼，但似乎問題在於你正在返回一個系列。這似乎在給定的例子的工作：

def make_mask(s): 
    if s.unique().shape[0] == 2: # If binary, return all-false mask 
     return np.zeros(s.shape[0], dtype=bool) 
    else: # Otherwise, identify outliers 
     return s >= np.percentile(s, 99)

可以進一步簡化，像這樣的代碼，並使用raw=True：

def make_mask(s): 
    if np.unique(s).size == 2: # If binary, return all-false mask 
     return np.zeros_like(s, dtype=bool) 
    else: # Otherwise, identify outliers 
     return s >= np.percentile(s, 99)

來源

2015-04-28 15:31:53 TheBlackCat

這也解決了我原始數據的問題。謝謝！ –

熊貓DataFrame應用功能加倍大小的DataFrame

回答

相關問題