我想了解如何做一個簡單的預測任務,我正在玩這個dataset,也是here在不同的格式。這是關於學生在某些課程中的表現,我想矢量化數據集的一些列,以便不使用所有數據(只是爲了瞭解它的工作原理)。所以我嘗試以下,以DictVectorizer:使用scikit向量化特定列的問題學習DictVectorizer?
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv')
dict_vect = DictVectorizer(sparse=False)
training_matrix = dict_vect.fit_transform(training_data['G1','G2','sex','school','age'])
training_matrix.toarray()
然後我想通過另一個特徵行是這樣的:
testing_data = pd.read_csv('/Users/user/Downloads/student/student-mat_test.csv')
test_matrix = dict_vect.transform(testing_data['G1','G2','sex','school','age'])
這樣做的問題是,我得到以下回溯:
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 school_2.py
Traceback (most recent call last):
File "/Users/user/PycharmProjects/PAN-pruebas/escuela_2.py", line 14, in <module>
X = dict_vect.fit_transform(df['sex','age','address','G1','G2'].values)
File "school_2.py", line 1787, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1794, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1079, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 2843, in get
loc = self.items.get_loc(item)
File "/usr/local/lib/python2.7/site-packages/pandas/core/index.py", line 1437, in get_loc
return self._engine.get_loc(_values_from_object(key))
File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
File "pandas/hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12349)
File "pandas/hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12300)
KeyError: ('sex', 'age', 'address', 'G1', 'G2')
Process finished with exit code 1
任何有關如何正確地對兩個數據(即訓練和測試)進行矢量化的想法?並且用兩個矩陣表示.toarray()
更新
>>>print training_data.info()
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/PAN-pruebas/escuela_3.py
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 396 entries, (school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences) to (MS, M, 19, U, LE3, T, 1, 1, other, at_home, course, father, 1, 1, 0, no, no, no, no, yes, yes, yes, no, 3, 2, 3, 3, 3, 5, 5)
Data columns (total 3 columns):
id 396 non-null object
content 396 non-null object
label 396 non-null object
dtypes: object(3)
memory usage: 22.7+ KB
None
Process finished with exit code 0
嗯,你的訓練數據只需3列,因爲它的加載一些的列作爲索引,也G1和G2甚至沒有在索引中,我會嘗試自己加載這個 – EdChum
我可以正確加載數據,但你似乎誤解如何使用Dict矢量化程序,它期待一個字典和不是數組:http://scikit-learn.org/0.11 /modules/generated/sklearn.feature_extraction.DictVectorizer.html。 – EdChum
我明白了..是否有任何其他方式矢量化一個.csv文件(「數據庫」)在orther呈現給估計? – skwoi