2014-01-27 28 views
0

我試圖使用logistic迴歸(with method ='bfgs'和l1 norm)對這裏的酒數據集進行分類 - http://archive.ics.uci.edu/ml/datasets/Wine+Quality 並捕獲了奇異值矩陣錯誤LinAlgError('奇異矩陣'),儘管是滿秩[我使用np.linalg.matrix_rank(data [train_cols] .values)]測試]ValueError:陣列在LinearSVC期間在_assert_all_finite中包含NaN或無窮大

這就是我如何得出結論:某些特徵可能是其他人的線性組合。爲此,我嘗試了使用網格搜索/ LinearSVC - 並且我得到了下面的錯誤以及我的代碼&數據集

我可以看到,只有6/7特徵實際上是「獨立的」 - 我比較x_train_new [0]和x_train的行解釋時(因此可以得到哪些列冗餘)

# Train & test DATA CREATION 
    from sklearn.svm import LinearSVC 
    import numpy, random 
    import pandas as pd 
    df = pd.read_csv("https://github.com/ekta1007/Predicting_wine_quality/blob/master/wine_red_dataset.csv") 
#,skiprows=0, sep=',') 


    df=df.dropna(axis=1,how='any') # also tried how='all' - still get NaN errors as below 
    header=list(df.columns.values) # or df.columns 
    X = df[df.columns - [header[-1]]] # header[-1] = ['quality'] - this is to make the code genric enough 
    Y = df[header[-1]] # df['quality'] 
    rows = random.sample(df.index, int(len(df)*0.7)) # indexing the rows that will be picked in the train set 
    x_train, y_train = X.ix[rows],Y.ix[rows] # Fetching the data frame using indexes 
    x_test,y_test = X.drop(rows),Y.drop(rows) 


# Training the classifier using C-Support Vector Classification. 
clf = LinearSVC(C=0.01, penalty="l1", dual=False) #,tol=0.0001,fit_intercept=True, intercept_scaling=1) 
clf.fit(x_train, y_train) 
x_train_new = clf.fit_transform(x_train, y_train) 
#print x_train_new #works 
clf.predict(x_test) # does NOT work and gives NaN errors for some x_tests 


clf.score(x_test, y_test) # Does NOT work 
clf.coef_ # Works, but I am not sure, if this is OK, given huge NaN's - or does the coef's get impacted ? 

clf.predict(x_train) 
552 NaN 
209 NaN 
427 NaN 
288 NaN 
175 NaN 
427 NaN 
748  7 
552 NaN 
429 NaN 
[... and MORE] 
Name: quality, Length: 1119 

clf.predict(x_test) 
76 NaN 
287 NaN 
420  7 
812 NaN 
443  7 
420  7 
430 NaN 
373  5 
624  5 
[..and More] 
Name: quality, Length: 480 

奇怪的是,當我運行clf.predict(x_train)時,我仍然看到了一些NaN - 我在做什麼錯了?在所有的模型被訓練完畢後,這不應該發生,對吧?

根據這一線索,我也查了,有沒有空在我的csv文件(雖然我重新標記「品質」到5只和7分的標籤(從範圍(3,10) How to fix "NaN or infinity" issue for sparse matrix in python?

還 - 這裏的x_test & y_test /火車的數據類型...

x_test 
<class 'pandas.core.frame.DataFrame'> 
Int64Index: 480 entries, 1 to 1596 
Data columns: 
alcohol     480 non-null values 
chlorides    480 non-null values 
citric acid    480 non-null values 
density     480 non-null values 
fixed acidity   480 non-null values 
free sulfur dioxide  480 non-null values 
pH      480 non-null values 
residual sugar   480 non-null values 
sulphates    480 non-null values 
total sulfur dioxide 480 non-null values 
volatile acidity  480 non-null values 
dtypes: float64(11) 

y_test 
1  5 
10 5 
18 5 
21 5 
30 5 
31 7 
36 7 
40 5 
50 5 
52 7 
53 5 
55 5 
57 5 
60 5 
61 5 
[..And MORE] 
Name: quality, Length: 480 

和最後..

clf.score(x_test, y_test) 

Traceback (most recent call last): 
    File "<pyshell#31>", line 1, in <module> 
    clf.score(x_test, y_test) 
    File "C:\Python27\lib\site-packages\sklearn\base.py", line 279, in score 
    return accuracy_score(y, self.predict(X)) 
    File "C:\Python27\lib\site-packages\sklearn\metrics\metrics.py", line 742, in accuracy_score 
    y_true, y_pred = check_arrays(y_true, y_pred) 
    File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 215, in check_arrays 
    File "C:\Python27\Lib\site-packages\sklearn\utils\validation.py", line 18, in _assert_all_finite 
ValueError: Array contains NaN or infinity. 


#I also explicitly checked for NaN's as here -: 
for i in df.columns: 
    df[i].isnull() 

提示:另請注意,如果我的使用LinearSVC的思考過程是正確的,給我的用例,還是應該使用網格搜索?

免責聲明:此代碼的部分內容來自StackOverflow和其他來源的類似上下文中的建議 - 我的實際使用案例只是試圖訪問,如果此方法非常適合我的方案。就這樣。

+0

@AndreasMueller - 看起來你只是向後看帖子,而你沒有看到問題的核心 - 這是「試驗」我是否確實具有其特徵的線性組合。無論如何,謝謝你提到大熊貓不符合scikit學習 - 我認爲這可能導致一些東西。 PPS:我非常理解我放在這裏的每一段代碼 - 除了我不知道的數據幀不匹配外。 – ekta

+0

我的帖子被刪除了嗎?那麼,我確實回答了關於您的問題的問題,這與您帖子的「心臟」無關。我很抱歉,如果我冒犯了你,但是作爲一個免責聲明,你從stackoverflow複製代碼給人的印象是你沒有。該代碼仍然是無效的python btw。 –

回答

2

這工作。唯一我必須真正改變的是使用x_test * 。值 *以及餘下的熊貓數據框(x_train,y_train,y_test)。正如指出的唯一理由是大熊貓DF之間不兼容scikit學習(使用numpy的陣列)

#changing your Pandas Dataframe elegantly to work with scikit-learn by transformation to numpy arrays 
>>> type(x_test) 
<class 'pandas.core.frame.DataFrame'> 
>>> type(x_test.values) 
<type 'numpy.ndarray'> 

這個技巧來自於這篇文章http://python.dzone.com/articles/python-making-scikit-learn-and和@AndreasMueller - 誰指出不一致。

相關問題