我試圖使用logistic迴歸(with method ='bfgs'和l1 norm)對這裏的酒數據集進行分類 - http://archive.ics.uci.edu/ml/datasets/Wine+Quality 並捕獲了奇異值矩陣錯誤LinAlgError('奇異矩陣'),儘管是滿秩[我使用np.linalg.matrix_rank(data [train_cols] .values)]測試]ValueError:陣列在LinearSVC期間在_assert_all_finite中包含NaN或無窮大
這就是我如何得出結論:某些特徵可能是其他人的線性組合。爲此,我嘗試了使用網格搜索/ LinearSVC - 並且我得到了下面的錯誤以及我的代碼&數據集
我可以看到,只有6/7特徵實際上是「獨立的」 - 我比較x_train_new [0]和x_train的行解釋時(因此可以得到哪些列冗餘)
# Train & test DATA CREATION
from sklearn.svm import LinearSVC
import numpy, random
import pandas as pd
df = pd.read_csv("https://github.com/ekta1007/Predicting_wine_quality/blob/master/wine_red_dataset.csv")
#,skiprows=0, sep=',')
df=df.dropna(axis=1,how='any') # also tried how='all' - still get NaN errors as below
header=list(df.columns.values) # or df.columns
X = df[df.columns - [header[-1]]] # header[-1] = ['quality'] - this is to make the code genric enough
Y = df[header[-1]] # df['quality']
rows = random.sample(df.index, int(len(df)*0.7)) # indexing the rows that will be picked in the train set
x_train, y_train = X.ix[rows],Y.ix[rows] # Fetching the data frame using indexes
x_test,y_test = X.drop(rows),Y.drop(rows)
# Training the classifier using C-Support Vector Classification.
clf = LinearSVC(C=0.01, penalty="l1", dual=False) #,tol=0.0001,fit_intercept=True, intercept_scaling=1)
clf.fit(x_train, y_train)
x_train_new = clf.fit_transform(x_train, y_train)
#print x_train_new #works
clf.predict(x_test) # does NOT work and gives NaN errors for some x_tests
clf.score(x_test, y_test) # Does NOT work
clf.coef_ # Works, but I am not sure, if this is OK, given huge NaN's - or does the coef's get impacted ?
clf.predict(x_train)
552 NaN
209 NaN
427 NaN
288 NaN
175 NaN
427 NaN
748 7
552 NaN
429 NaN
[... and MORE]
Name: quality, Length: 1119
clf.predict(x_test)
76 NaN
287 NaN
420 7
812 NaN
443 7
420 7
430 NaN
373 5
624 5
[..and More]
Name: quality, Length: 480
奇怪的是,當我運行clf.predict(x_train)時,我仍然看到了一些NaN - 我在做什麼錯了?在所有的模型被訓練完畢後,這不應該發生,對吧?
根據這一線索,我也查了,有沒有空在我的csv文件(雖然我重新標記「品質」到5只和7分的標籤(從範圍(3,10) How to fix "NaN or infinity" issue for sparse matrix in python?
還 - 這裏的x_test & y_test /火車的數據類型...
x_test
<class 'pandas.core.frame.DataFrame'>
Int64Index: 480 entries, 1 to 1596
Data columns:
alcohol 480 non-null values
chlorides 480 non-null values
citric acid 480 non-null values
density 480 non-null values
fixed acidity 480 non-null values
free sulfur dioxide 480 non-null values
pH 480 non-null values
residual sugar 480 non-null values
sulphates 480 non-null values
total sulfur dioxide 480 non-null values
volatile acidity 480 non-null values
dtypes: float64(11)
y_test
1 5
10 5
18 5
21 5
30 5
31 7
36 7
40 5
50 5
52 7
53 5
55 5
57 5
60 5
61 5
[..And MORE]
Name: quality, Length: 480
和最後..
clf.score(x_test, y_test)
Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
clf.score(x_test, y_test)
File "C:\Python27\lib\site-packages\sklearn\base.py", line 279, in score
return accuracy_score(y, self.predict(X))
File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 742, in accuracy_score
y_true, y_pred = check_arrays(y_true, y_pred)
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 215, in check_arrays
File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 18, in _assert_all_finite
ValueError: Array contains NaN or infinity.
#I also explicitly checked for NaN's as here -:
for i in df.columns:
df[i].isnull()
提示:另請注意,如果我的使用LinearSVC的思考過程是正確的,給我的用例,還是應該使用網格搜索?
免責聲明:此代碼的部分內容來自StackOverflow和其他來源的類似上下文中的建議 - 我的實際使用案例只是試圖訪問,如果此方法非常適合我的方案。就這樣。
@AndreasMueller - 看起來你只是向後看帖子,而你沒有看到問題的核心 - 這是「試驗」我是否確實具有其特徵的線性組合。無論如何,謝謝你提到大熊貓不符合scikit學習 - 我認爲這可能導致一些東西。 PPS:我非常理解我放在這裏的每一段代碼 - 除了我不知道的數據幀不匹配外。 – ekta
我的帖子被刪除了嗎?那麼,我確實回答了關於您的問題的問題,這與您帖子的「心臟」無關。我很抱歉,如果我冒犯了你,但是作爲一個免責聲明,你從stackoverflow複製代碼給人的印象是你沒有。該代碼仍然是無效的python btw。 –