我有一個形狀的數據框(14407,2564)。我正嘗試使用VarianceThreshold函數去除低方差特徵。但是,當我調用fit_transform時,出現以下錯誤:fit_transform中的錯誤:輸入包含NaN,無窮大或值太大(dtype('float64'))
ValueError:輸入包含NaN,無窮大或對於dtype('float64')來說值太大。
df.replace('null',np.NaN, inplace=True)
df.replace(r'^\s*$', np.NaN, regex=True, inplace=True)
df.fillna(value=df.median(), inplace=True)
我使用檢查我的數據幀之後的任何空/無限值:
m = df.isnull().any()
print "========= COLUMNS WITH NULL VALUES ================="
print m[m]
print "========= COLUMNS WITH INFINITE VALUES ================="
m = np.isfinite(df.select_dtypes(include=['float64'])).any()
print m[m]
和
usign VarianceThreshold之前,我從我的DF使用下面的代碼替換所有缺失值我有一個空的系列作爲輸出,這意味着我所有的列都沒有任何缺失值。輸出是:
========= COLUMNS WITH NULL VALUES =================
Series([], dtype: bool)
========= COLUMNS WITH INFINITE VALUES =================
Series([], dtype: bool)
完整的錯誤跟蹤:
Traceback (most recent call last):
File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 222, in <module>
main()
File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 218, in main
getAllData()
File "/home/users/MyUsername/MyProject/src/main/python/Main.py", line 95, in getAllData
predictors, labels, dropped_features = fselector.process(variance=True, corr=True, bestf=True, bestfk=200)
File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 54, in process
self.getVariance(threshold=(.95 * (1 - .95)))
File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 136, in getVariance
self.removeLowVarianceColumns(df=self.X, thresh=threshold)
File "/home/users/MyUsername/MyProject/src/main/python/classes/featureselector.py", line 213, in removeLowVarianceColumns
selector.fit_transform(df)
File "/usr/lib64/python2.7/site-packages/sklearn/base.py", line 494, in fit_transform
return self.fit(X, **fit_params).transform(X)
File "/usr/lib64/python2.7/site-packages/sklearn/feature_selection/variance_threshold.py", line 64, in fit
X = check_array(X, ('csr', 'csc'), dtype=np.float64)
File "/usr/lib64/python2.7/site-packages/sklearn/utils/validation.py", line 407, in check_array
_assert_all_finite(array)
File "/usr/lib64/python2.7/site-packages/sklearn/utils/validation.py", line 58, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
所以,我不知道要檢查什麼,這似乎並不像一個缺失值的問題,但我也沒能得到哪些列/值導致問題。
我在這裏看到了幾個線程,最後都有一個缺失值,但這似乎並不是問題。
你應該總是發佈完整的堆棧跟蹤的錯誤 –
@VivekKumar我將它添加到文章 – Sarah
首先將其轉換爲np數組''X = np.asanyarray(df)'。然後,檢查以下兩條語句是否返回true或假:1)'np.isfinite(X.sum())'2)'np.isfinite(X).all()' –