2016-09-17 123 views
2

我寫了下面的函數到數據幀的數列轉換成數值:大熊貓:錯誤的DataFrame.unstack

def factorizeMany(data, columns): 
    """ Factorize a bunch of columns in a data frame""" 
    data[columns] = data[columns].stack().rank(method='dense').unstack() 
    return data 

調用它像這樣

trainDataPre = factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)]) 

給我一個錯誤。我不知道在哪裏尋找原因,可能是錯誤的輸入?

--------------------------------------------------------------------------- 
AttributeError       Traceback (most recent call last) 
<ipython-input-14-357f8a4b2ef8> in <module>() 
     1 #trainDataPre = trainDataMerged.drop(["people_id", "activity_id", "date"], axis=1) 
     2 #trainDataPre = trainDataMerged.fillna(0) 
----> 3 trainDataPre = mininggear.factorizeMany(trainDataMerged.fillna(0), columns=["char_{0}".format(i) for i in range(1,10)]) 

/Users/cls/Dropbox/Datengräber/Kaggle/RedHat/mininggear.py in factorizeMany(data, columns) 
    15 def factorizeMany(data, columns): 
    16  """ Factorize a bunch of columns in a data frame""" 
---> 17  data[columns] = data[columns].stack().rank(method='dense').unstack() 
    18  return data 
    19 

/usr/local/lib/python3.5/site-packages/pandas/core/series.py in unstack(self, level, fill_value) 
    2041   """ 
    2042   from pandas.core.reshape import unstack 
-> 2043   return unstack(self, level, fill_value) 
    2044 
    2045  # ---------------------------------------------------------------------- 

/usr/local/lib/python3.5/site-packages/pandas/core/reshape.py in unstack(obj, level, fill_value) 
    405  else: 
    406   unstacker = _Unstacker(obj.values, obj.index, level=level, 
--> 407        fill_value=fill_value) 
    408   return unstacker.get_result() 
    409 

/usr/local/lib/python3.5/site-packages/pandas/core/reshape.py in __init__(self, values, index, level, value_columns, fill_value) 
    90 
    91   # when index includes `nan`, need to lift levels/strides by 1 
---> 92   self.lift = 1 if -1 in self.index.labels[self.level] else 0 
    93 
    94   self.new_index_levels = list(index.levels) 

AttributeError: 'Index' object has no attribute 'labels' 
+0

你能提供你的'trainDataMerged'數據幀的樣本? –

+0

@ AlbertoGarcia-Raboso發佈巨大的CSV字符串?如果該樣本不包含導致錯誤的數據,該怎麼辦?正如下面的答案所示,這個問題可以通過一些見解來回答。 – clstaudt

回答

1

該錯誤是由於這樣的事實,你正在試圖通過在數據幀用0填充NaN's和調用該函數含有數值和分類/字符串值的數據幀的子集上執行的rank操作。

考慮這種情況下:

df = pd.DataFrame({'char_1': ['cat', 'dog', 'buffalo', 'cat'], 
        'char_2': ['mouse', 'tiger', 'lion', 'mouse'], 
        'char_3': ['giraffe', np.NaN, 'cat', np.NaN]}) 
df 

Image

df = df.fillna(0) 
df[['char_3']].stack().rank() 
Series([], dtype: float64) 

所以,你基本上是在一個空系列這是不是你想畢竟做什麼執行unstack操作。

更好的是做到這一點的方式,以避免進一步的併發症:

def factorizeMany(data, columns): 
    """ Factorize a bunch of columns in a data frame""" 
    stacked = data[columns].stack(dropna=False) 
    data[columns] = pandas.Series(stacked.factorize()[0], index=stacked.index).unstack() 
    return data 
+0

對不起,只改變'factorizeMany'和'fillna'的順序仍會產生錯誤。 – clstaudt

+1

另外,'stack'方法默認會刪除所有'NaN'值。所以,你可能想用'stack(dropna = False)'將它提供給'factorize'方法,它將它們標記爲-1。 –