0

我正在嘗試爲邏輯迴歸分類器執行特徵選擇。最初有4個變量:姓名,地點,性別和標籤=種族。這三個變量,即名稱,會產生成千上萬個「特徵」,例如,名稱「John Snow」會產生2個字母的子字符串,如'jo','oh','hn'。等等。特徵集經過DictVectorization。scikit中的特徵選擇學習多個變量和數千個特徵

我試圖關注本教程(http://scikit-learn.org/stable/auto_examples/feature_selection/plot_feature_selection.html),但我不確定自己是否正確,因爲本教程使用了少量功能,而我的矢量化後卻有數萬個功能。並且plt.show()顯示空白圖

# coding=utf-8 
import pandas as pd 
from pandas import DataFrame, Series 
import numpy as np 
import re 
import random 
import time 
from random import randint 
import csv 
import sys 
reload(sys) 
sys.setdefaultencoding('utf-8') 

from sklearn import svm 
from sklearn.metrics import classification_report 
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import LinearSVC 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.feature_extraction import DictVectorizer 
from sklearn.feature_selection import SelectPercentile, f_classif 
from sklearn.metrics import confusion_matrix as sk_confusion_matrix 
from sklearn.metrics import roc_curve, auc 
import matplotlib.pyplot as plt 
from sklearn.metrics import precision_recall_curve 

# Assign X and y variables 
X = df.raw_name.values 
X2 = df.name.values 
X3 = df.gender.values 
X4 = df.location.values 
y = df.ethnicity_scan.values 

# Feature extraction functions 
def feature_full_name(nameString): 
    try: 
     full_name = nameString 
     if len(full_name) > 1: # not accept name with only 1 character 
      return full_name 
     else: return '?' 
    except: return '?' 

def feature_avg_wordLength(nameString): 
    try: 
     space = 0 
     for i in nameString: 
      if i == ' ': 
       space += 1 
     length = float(len(nameString) - space) 
     name_entity = float(space + 1) 
     avg = round(float(length/name_entity), 0) 
     return avg 
    except: 
     return 0 

def feature_name_entity(nameString2): 
    space = 0 
    try: 
     for i in nameString2: 
      if i == ' ': 
       space += 1 
     return space+1 
    except: return 0 

def feature_gender(genString): 
    try: 
     gender = genString 
     if len(gender) >= 1: 
      return gender 
     else: return '?' 
    except: return '?' 

def feature_noNeighborLoc(locString): 
    try: 
     x = re.sub(r'^[^, ]*', '', locString) # remove everything before and include first ',' 
     y = x[2:] # remove subsequent ',' and ' ' 
     return y 
    except: return '?' 

def list_to_dict(substring_list): 
    try: 
     substring_dict = {} 
     for i in substring_list: 
      substring_dict['substring='+str(i)] = True 
     return substring_dict 
    except: return '?' 

# Transform format of X variables, and spit out a numpy array for all features 
my_dict13 = [{'name-entity': feature_name_entity(feature_full_name(i))} for i in X2] 
my_dict14 = [{'avg-length': feature_avg_wordLength(feature_full_name(i))} for i in X] 
my_dict15 = [{'gender': feature_full_name(i)} for i in X3] 
my_dict16 = [{'location': feature_noNeighborLoc(feature_full_name(i))} for i in X4] 

my_dict17 = [{'dummy1': 1} for i in X] 
my_dict18 = [{'dummy2': random.randint(0,2)} for i in X] 

all_dict = [] 
for i in range(0, len(my_dict)): 
    temp_dict = dict(my_dict13[i].items() + my_dict14[i].items() 
     + my_dict15[i].items() + my_dict16[i].items() + my_dict17[i].items() + my_dict18[i].items() 
     ) 
    all_dict.append(temp_dict) 

newX = dv.fit_transform(all_dict) 

# Separate the training and testing data sets 
half_cut = int(len(df)/2.0)*-1 
X_train = newX[:half_cut] 
X_test = newX[half_cut:] 
y_train = y[:half_cut] 
y_test = y[half_cut:] 

# Fitting X and y into model, using training data 
lr = LogisticRegression() 
lr.fit(X_train, y_train) 
dv = DictVectorizer() 

# Feature selection 
plt.figure(1) 
plt.clf() 
X_indices = np.arange(X_train.shape[-1]) 
selector = SelectPercentile(f_classif, percentile=10) 
selector.fit(X_train, y_train) 
scores = -np.log10(selector.pvalues_) 
scores /= scores.max() 
plt.bar(X_indices - .45, scores, width=.2, 
    label=r'Univariate score ($-Log(p_{value})$)', color='g') 
plt.show() 

警告:

E:\Program Files Extra\Python27\lib\site-packages\sklearn\feature_selection\univariate_selection.py:111: UserWarning: Features [[0 0 0 ..., 0 0 0]] are constant. 
+1

沒有錯誤跟蹤。只有警告(上面),它能夠生成(但是是空的)圖。 – KubiK888

回答

0

看起來你將數據分成訓練和測試集的方式是行不通的:

# Separate the training and testing data sets 
X_train = newX[:half_cut] 
X_test = newX[half_cut:] 

如果您已經使用sklearn,這是很使用內置分裂程序更方便:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.5, random_state=0) 
相關問題