2017-10-08 38 views
0

我試圖將具有DataFrame文本的列轉換爲一個熱點編碼矩陣。這在一段時間內運行良好,但因我未知的原因而停止工作。消息說:「'str'和'float'的實例之間不支持TypeError:'>'對我來說,這似乎是無稽之談,因爲我只使用tekst數據。當我用一個小數據集重複實驗時,LabelBinarizer工作得很好,併產生所需的輸出。LabelBinarizer由於NaN的行爲不一致

我注意到X_train數據幀的大小爲4.6 GB。我的機器只有8 GB。是否有一些內存限制我應該知道?所有數值都很小,我應該轉換爲int32和float32嗎?

我能夠重現下面的錯誤。但我不確定這是否提供了足夠的信息。

from sklearn.preprocessing import LabelBinarizer 

lb=LabelBinarizer() 

s=['a','b','c','b','a'] 

df=pd.DataFrame (s) 

df = pd.Series (s) 

dd = X_train['state'] 

type(dd) 
Out[9]: pandas.core.series.Series 

type(df) 
Out[10]: pandas.core.series.Series 

lb.fit(dd) 
Traceback (most recent call last): 

    File "<ipython-input-11-5ec245111e31>", line 1, in <module> 
    lb.fit(dd) 

    File "C:\packages\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 296, in fit 
    self.y_type_ = type_of_target(y) 

    File "C:\packages\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py", line 275, in type_of_target 
    if (len(np.unique(y)) > 2) or (y.ndim >= 2 and len(y[0]) > 1): 

    File "C:\packages\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py", line 214, in unique 
    ar.sort() 

TypeError: '>' not supported between instances of 'str' and 'float' 


lb.fit(df) 
Out[12]: LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False) 

df.value_counts() 
Out[13]: 
a 2 
b 2 
c 1 
dtype: int64 

dd.value_counts() 
Out[14]: 
MI 228601 
CA  5020 
TX  2420 
FL  2237 
IL  1310 
SC  1304 
OH  967 
NY  673 
MN  632 
GA  535 
NV  484 
UT  477 
PA  466 
NJ  395 
VA  385 
NC  353 
MD  349 
AZ  329 
ME  261 
OK  248 
AL  215 
TN  207 
WA  192 
MA  182 
IA  159 
WI  159 
OR  153 
MO  151 
CO  147 
KY  146 
IN  106 
AR  82 
LA  81 
AK  79 
UK  77 
NB  77 
MS  64 
CT  60 
DC  58 
ON  51 
DE  50 
KS  37 
RI  35 
SD  33 
ID  33 
MT  28 
NM  21 
BC  17 
WY  12 
HI  10 
NH   9 
VT   7 
VI   6 
WV   6 
PR   5 
QC   5 
QL   3 
ND   2 
BL   2 
Name: state, dtype: int64 

len(df) 
Out[15]: 5 

len(dd) 
Out[16]: 250306 

回答

1

也許它的輸入數據可能包含缺失值。

from sklearn.preprocessing import LabelBinarizer 
import numpy as np 
import pandas as pd 

lb = LabelBinarizer() 

s = ['a','b','c','b','a', np.nan] 
df = pd.DataFrame(s, columns=["state"]) 

df_binarized = lb.fit_transform(df['state']) 
df_binarized 

Traceback (most recent call last): 
    File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code 
    exec(code_obj, self.user_global_ns, self.user_ns) 
    File "<ipython-input-45-f16e01b4e1be>", line 4, in <module> 
    df_binarized = lb.fit_transform(df['state']) 
    File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/sklearn/base.py", line 494, in fit_transform 
    return self.fit(X, **fit_params).transform(X) 
    File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 296, in fit 
    self.y_type_ = type_of_target(y) 
    File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/sklearn/utils/multiclass.py", line 275, in type_of_target 
    if (len(np.unique(y)) > 2) or (y.ndim >= 2 and len(y[0]) > 1): 
    File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 210, in unique 
    return _unique1d(ar, return_index, return_inverse, return_counts) 
    File "/home/kuroyanagi/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/numpy/lib/arraysetops.py", line 277, in _unique1d 
ar.sort() 
TypeError: '<' not supported between instances of 'float' and 'str' 

如果沒有缺失值,它的工作原理如下。

from sklearn.preprocessing import LabelBinarizer 
import numpy as np 
import pandas as pd 

s = ['a','b','c','b','a'] 
df = pd.DataFrame(s, columns=["state"]) 

df_binarized = lb.fit_transform(df['state']) 
df_binarized 

Out[46]: 
array([[1, 0, 0], 
     [0, 1, 0], 
     [0, 0, 1], 
     [0, 1, 0], 
     [1, 0, 0]]) 
+0

是的!那是罪魁禍首。非常感謝。我掃描了我的列以測試每個值的類型,並注意到一些NaN被解釋爲float。很混亂。當你要求熊貓系列的dtype時,它可能會對你說謊。我改變了標題以反映這一發現。 – Arnold

+1

感謝您更改標題。我過去也遇到過同樣的問題。可能對那些有同樣問題的人有用。 – Keiku

+0

我很高興你分享這個解決方案。如果只有錯誤信息更加清晰,那麼一天就不會花費我的成本。 – Arnold