1
我想在Python中運行kNN(k-最近鄰居)算法。訓練測試拆分似乎不能在Python中正常工作?
我使用的嘗試做到這一點可在UCI機器學習庫的數據集:https://archive.ics.uci.edu/ml/datasets/wine
這裏是我使用的代碼:
#1. LIBRARIES
import os
import pandas as pd
import numpy as np
print os.getcwd() # Prints the working directory
os.chdir('C:\\file_path') # Provide the path here
#2. VARIABLES
variables = pd.read_csv('wines.csv')
winery = variables['winery']
alcohol = variables['alcohol']
malic = variables['malic']
ash = variables['ash']
ash_alcalinity = variables['ash_alcalinity']
magnesium = variables['magnesium']
phenols = variables['phenols']
flavanoids = variables['flavanoids']
nonflavanoids = variables['nonflavanoids']
proanthocyanins = variables['proanthocyanins']
color_intensity = variables['color_intensity']
hue = variables['hue']
od280 = variables['od280']
proline = variables['proline']
#3. MAX-MIN NORMALIZATION
alcoholscaled=(alcohol-min(alcohol))/(max(alcohol)-min(alcohol))
malicscaled=(malic-min(malic))/(max(malic)-min(malic))
ashscaled=(ash-min(ash))/(max(ash)-min(ash))
ash_alcalinity_scaled=(ash_alcalinity-min(ash_alcalinity))/(max(ash_alcalinity)-min(ash_alcalinity))
magnesiumscaled=(magnesium-min(magnesium))/(max(magnesium)-min(magnesium))
phenolsscaled=(phenols-min(phenols))/(max(phenols)-min(phenols))
flavanoidsscaled=(flavanoids-min(flavanoids))/(max(flavanoids)-min(flavanoids))
nonflavanoidsscaled=(nonflavanoids-min(nonflavanoids))/(max(nonflavanoids)-min(nonflavanoids))
proanthocyaninsscaled=(proanthocyanins-min(proanthocyanins))/(max(proanthocyanins)-min(proanthocyanins))
color_intensity_scaled=(color_intensity-min(color_intensity))/(max(color_intensity)-min(color_intensity))
huescaled=(hue-min(hue))/(max(hue)-min(hue))
od280scaled=(od280-min(od280))/(max(od280)-min(od280))
prolinescaled=(proline-min(proline))/(max(proline)-min(proline))
alcoholscaled.mean()
alcoholscaled.median()
alcoholscaled.min()
alcoholscaled.max()
#4. DATA FRAME
d = {'alcoholscaled' : pd.Series([alcoholscaled]),
'malicscaled' : pd.Series([malicscaled]),
'ashscaled' : pd.Series([ashscaled]),
'ash_alcalinity_scaled' : pd.Series([ash_alcalinity_scaled]),
'magnesiumscaled' : pd.Series([magnesiumscaled]),
'phenolsscaled' : pd.Series([phenolsscaled]),
'flavanoidsscaled' : pd.Series([flavanoidsscaled]),
'nonflavanoidsscaled' : pd.Series([nonflavanoidsscaled]),
'proanthocyaninsscaled' : pd.Series([proanthocyaninsscaled]),
'color_intensity_scaled' : pd.Series([color_intensity_scaled]),
'hue_scaled' : pd.Series([huescaled]),
'od280scaled' : pd.Series([od280scaled]),
'prolinescaled' : pd.Series([prolinescaled])}
df = pd.DataFrame(d)
#5. TRAIN-TEST SPLIT
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.matrix(df),np.matrix(winery),test_size=0.3)
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape
#6. K-NEAREST NEIGHBOUR ALGORITHM
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
在第5節,當我運行sklearn.model_selection導入列車測試拆分機制,但由於它提供了形狀:(0,13) (0,178) (1,13) (1,178)
,因此它看起來沒有正確運行。
然後,在試圖運行knn時,我收到錯誤消息:Found array with 0 sample(s) (shape=(0,13)) while a minimum of 1 is required.
這不是由於使用最大 - 最小歸一化進行縮放,因爲即使變量未縮放,我仍然收到此錯誤消息。
也許'test_train_split()'想要一個'numpy.ndarray'不是'numpy.matrixlib.defmatrix.matrix' – cardamom
非常感謝。當我嘗試導入wine數據集時,出現錯誤「無法導入名稱load_wine」。我試圖用load_iris來做到這一點,它沒有任何問題,甚至運行knn.fit。我在想這個特定數據集有錯誤,或者當我嘗試手動導入時,它是不同的文件類型。 – empoleon
@cardamom:當我最初生成數據框時,它的格式爲pandas.core.frame.DataFrame。不知道這個問題是否源於此。 – empoleon