2013-10-07 52 views
3

我對scikit-learn很陌生,我試圖用這個軟件包對收入數據進行預測。 這可能是一個重複的問題,因爲我看到了另一篇文章,但我正在尋找一個簡單的例子來理解scikit-learn估計器的期望。使用scikit-learn處理太多分類特徵

我的數據是以下結構,其中的許多功能是分類的(例如:workclass,教育..)

age: continuous. 
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
fnlwgt: continuous. 
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
education-num: continuous. 
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
sex: Female, Male. 
capital-gain: continuous. 
capital-loss: continuous. 
hours-per-week: continuous. 
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. 

實施例記錄:

38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K 
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband  Black Male 0 0 40 United-States <=50K 
30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband  Asian-Pac-Islander Male 0 0 40 India >50K 

我有一個很難作爲sckit-learn中的大多數模型處理分類特徵期望所有特徵都是數字? 他們提供了一些類來轉換/編碼這些功能(如Onehotencoder,DictVectorizer),但我找不到在我的數據上使用這些功能的方法。我知道在我將這些步驟完全編碼爲數字之前,有很多步驟涉及到,但我只是想知道是否有人知道更簡單高效(因爲有太多這樣的特徵),可以通過示例來理解。 我隱約知道DictVectorizer是要走的路,但需要在這裏如何進行幫助。

回答

6

以下是使用DictVectorizer的一些示例代碼。首先,讓我們在Python shell中設置一些數據。我從一個文件中讀取給你。

>>> features = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", 
...    "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country"] 
>>> input_text = """38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K 
... 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband  Black Male 0 0 40 United-States <=50K 
... 30 State-gov 141297 Bachelors 13 Married-civ-spouse Prof-specialty Husband  Asian-Pac-Islander Male 0 0 40 India >50K 
... """ 

現在,解析這些:

>>> for ln in input_text.splitlines(): 
...  values = ln.split() 
...  y.append(values[-1]) 
...  d = dict(zip(features, values[:-1])) 
...  samples.append(d) 

有什麼我們現在得到了什麼?讓我們來看看:

>>> from pprint import pprint 
>>> pprint(samples[0]) 
{'age': '38', 
'capital-gain': '0', 
'capital-loss': '0', 
'education': 'HS-grad', 
'education-num': '9', 
'fnlwgt': '215646', 
'hours-per-week': '40', 
'marital-status': 'Divorced', 
'native-country': 'United-States', 
'occupation': 'Handlers-cleaners', 
'race': 'White', 
'relationship': 'Not-in-family', 
'sex': 'Male', 
'workclass': 'Private'} 
>>> print(y) 
['<=50K', '<=50K', '>50K'] 

這些samples準備好DictVectorizer,所以通過他們:

>>> from sklearn.feature_extraction import DictVectorizer 
>>> dv = DictVectorizer() 
>>> X = dv.fit_transform(samples) 
>>> X 
<3x29 sparse matrix of type '<type 'numpy.float64'>' 
     with 42 stored elements in Compressed Sparse Row format> 

的Et瞧,你有Xy可以被傳遞到一個估計,只要它支持稀疏矩陣。 (否則,將sparse=False傳遞給DictVectorizer構造函數。)

測試樣本可以類似地傳遞給DictVectorizer.transform;如果測試集中存在不存在於訓練集中的特徵/值組合,則這些將被忽略(因爲學習模型無論如何都無法對其進行任何明智的操作)。