2015-04-30 34 views
1

我想了解如何做一個簡單的預測任務,我正在玩這個dataset,也是here在不同的格式。這是關於學生在某些課程中的表現,我想矢量化數據集的一些列,以便不使用所有數據(只是爲了瞭解它的工作原理)。所以我嘗試以下,以DictVectorizer使用scikit向量化特定列的問題學習DictVectorizer?

import pandas as pd 
from sklearn.feature_extraction import DictVectorizer 

training_data = pd.read_csv('/Users/user/Downloads/student/student-mat.csv') 

dict_vect = DictVectorizer(sparse=False) 

training_matrix = dict_vect.fit_transform(training_data['G1','G2','sex','school','age']) 
training_matrix.toarray() 

然後我想通過另一個特徵行是這樣的:

testing_data = pd.read_csv('/Users/user/Downloads/student/student-mat_test.csv') 
test_matrix = dict_vect.transform(testing_data['G1','G2','sex','school','age']) 

這樣做的問題是,我得到以下回溯:

/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 school_2.py 
Traceback (most recent call last): 
    File "/Users/user/PycharmProjects/PAN-pruebas/escuela_2.py", line 14, in <module> 
    X = dict_vect.fit_transform(df['sex','age','address','G1','G2'].values) 
    File "school_2.py", line 1787, in __getitem__ 
    return self._getitem_column(key) 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 1794, in _getitem_column 
    return self._get_item_cache(key) 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/generic.py", line 1079, in _get_item_cache 
    values = self._data.get(item) 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 2843, in get 
    loc = self.items.get_loc(item) 
    File "/usr/local/lib/python2.7/site-packages/pandas/core/index.py", line 1437, in get_loc 
    return self._engine.get_loc(_values_from_object(key)) 
    File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824) 
    File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704) 
    File "pandas/hashtable.pyx", line 697, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12349) 
    File "pandas/hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12300) 
KeyError: ('sex', 'age', 'address', 'G1', 'G2') 

Process finished with exit code 1 

任何有關如何正確地對兩個數據(即訓練和測試)進行矢量化的想法?並且用兩個矩陣表示.toarray()

更新

>>>print training_data.info() 
/usr/local/Cellar/python/2.7.8_1/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/PAN-pruebas/escuela_3.py 
<class 'pandas.core.frame.DataFrame'> 
MultiIndex: 396 entries, (school, sex, age, address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime, studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic, famrel, freetime, goout, Dalc, Walc, health, absences) to (MS, M, 19, U, LE3, T, 1, 1, other, at_home, course, father, 1, 1, 0, no, no, no, no, yes, yes, yes, no, 3, 2, 3, 3, 3, 5, 5) 
Data columns (total 3 columns): 
id   396 non-null object 
content 396 non-null object 
label  396 non-null object 
dtypes: object(3) 
memory usage: 22.7+ KB 
None 

Process finished with exit code 0 
+0

嗯,你的訓練數據只需3列,因爲它的加載一些的列作爲索引,也G1和G2甚至沒有在索引中,我會嘗試自己加載這個 – EdChum

+1

我可以正確加載數據,但你似乎誤解如何使用Dict矢量化程序,它期待一個字典和不是數組:http://scikit-learn.org/0.11 /modules/generated/sklearn.feature_extraction.DictVectorizer.html。 – EdChum

+0

我明白了..是否有任何其他方式矢量化一個.csv文件(「數據庫」)在orther呈現給估計? – skwoi

回答

1

你需要傳遞一個列表:

test_matrix = dict_vect.transform(testing_data[['G1','G2','sex','school','age']]) 

你所做的是試圖索引你的DF的鑰匙是什麼:

['G1','G2','sex','school','age'] 

這是爲什麼你得到一個KeyError,因爲沒有像上面這樣命名的單列,所以選擇mul你需要tiple列通過列名的列表和雙標[[col_list]]

例子:

In [43]: 

df = pd.DataFrame(columns=['a','b']) 
df 
Out[43]: 
Empty DataFrame 
Columns: [a, b] 
Index: [] 
In [44]: 

df['a','b'] 
--------------------------------------------------------------------------- 
KeyError         Traceback (most recent call last) 
<ipython-input-44-33332c7e7227> in <module>() 
----> 1 df['a','b'] 

......  
pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12349)() 

pandas\hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12300)() 

KeyError: ('a', 'b') 

但這個工程:

In [45]: 

df[['a','b']] 
Out[45]: 
Empty DataFrame 
Columns: [a, b] 
Index: [] 
+0

我試過以下內容:'training_data = pd.read_csv('/ Users/user /Downloads/student/student-mat.csv',names = ['id','content','label'])# testing_data = pd.read_csv('/ Users/user/Desktop/student-mat_test.csv' ) dict_vect = DictVectorizer(稀疏= FALSE) training_matrix = dict_vect.fit_transform(training_data [ 'G1', 'G2', '性別', '學校', '年齡']) 打印training_matrix.toarray () '仍然得到同樣的錯誤如何進行任何想法d – skwoi

+1

我只是下載數據,並會嘗試重現您的錯誤。你可以編輯你的問題來自'training_data.info()'的輸出 – EdChum