2016-09-11 35 views
1

我在創建的合成數據集上使用sklearn.linear_model.Perceptron。數據由2個類組成,每個類是具有共同非對角協方差矩陣的多變量高斯分佈。這些類的質心非常接近,以至於存在顯着的重疊。爲什麼sklearn的感知器以1的精度,精度等來預測?

mean1 = np.ones((20,)) 
mean2 = 2 * np.ones((20,)) 

A = 0.1 * np.random.randn(20,20) 
cov = np.dot(A, A.T) 

class1 = np.random.multivariate_normal(mean1, cov, 2000) 
class2 = np.random.multivariate_normal(mean2, cov, 2000) 

class1 = np.concatenate((class1, np.ones((len(class1), 1))), axis=1) 
class2 = np.concatenate((class2, 2*np.ones((len(class2), 1))), axis=1) 

class1_train, class1_test = train_test_split(class1, test_size=0.3) 
class2_train, class2_test = train_test_split(class2, test_size=0.3) 
train = np.concatenate((class1_train, class2_train), axis=0) 
test = np.concatenate((class1_test, class2_test), axis=0) 

np.random.shuffle(train) 
np.random.shuffle(test) 
y_train = train[:,20] 
x_train = train[:,0:20] 
y_test = test[:,20] 
x_test = test[:,0:20] 

保存這些數據後,我只是用:

classifier = sklearn.linear_model.Perceptron() 
classifier.fit(x_train, y_train) 
predicted_test = classifier.predict(x_test) 
accuracy = sklearn.metrics.accuracy_score(y_test, predicted_test) 
precision = sklearn.metrics.precision_score(y_test, predicted_test) 
recall = sklearn.metrics.recall_score(y_test, predicted_test) 
f_measure = sklearn.metrics.f1_score(y_test, predicted_test) 
print(accuracy, precision, recall, f_measure) 

的數據是由設計重疊。但是,線性分類器能夠以某種精度,精度等完全預測,全部爲1.

+0

請轉成[MCVE]這一點。有大量未定義的變量和函數。 – cel

+0

謝謝。我將按照鏈接中的說明重寫問題。 –

回答

-1

使用cross_validation.train_test_split的正確方法是給它一個完整的數據集,並讓它將數據分區到x_train, x_test, y_train, y_test

下面的代碼工作得更好:

class1 = np.random.multivariate_normal(mean1, cov, 2000) 
class2 = np.random.multivariate_normal(mean2, cov, 2000) 

class1 = np.concatenate((class1, np.ones((len(class1), 1))), axis=1) 
class2 = np.concatenate((class2, 2*np.ones((len(class2), 1))), axis=1) 

dataset = np.concatenate((class1, class2), axis=0) 

np.random.shuffle(dataset) 

x_train, x_test, y_train, y_test = \ 
    cross_validation.train_test_split(dataset[:,:20], dataset[:,20], test_size=0.3) 

注意,感知實際上可以acheive 100%的準確率與您的數據。嘗試添加一些噪音,以獲得它的感覺。

例如:

noise = np.random.normal(0,1,(4000, 20)) 

dataset[:, 0:20] = dataset[:, 0:20] + noise 

x_train, x_test, y_train, y_test = \ 
    cross_validation.train_test_split(dataset[:,:20], dataset[:,20], test_size=0.3)