0

我使用邏輯迴歸分類器來預測種族類別標籤0,1。我的數據被分解成測試和訓練樣本,並將字典向量化爲稀疏矩陣。如何在python scikit-learn中使用dict-vectorization預測單個新樣本?

以下是工作代碼,在那裏我預測和驗證X_train和X_test這是得到了矢量化的部分特徵:

for i in mass[k]: 
    df = df_temp # reset df before each loop 
    #$$ 
    if 1==1: 
    count+=1 
    ethnicity_tar = str(i) 
    ############################################ 
    ############################################ 

    def ethnicity_target(row): 
     try: 
      if row[ethnicity_var] == ethnicity_tar: 
       return 1 
      else: 
       return 0 
     except: return None 
    df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1) 
    print '1=', ethnicity_tar 
    print '0=', 'non-'+ethnicity_tar 

    # Random sampling a smaller dataframe for debugging 
    rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness 
    df = DataFrame(rows) 
    print 'Class count:' 
    print df['ethnicity_scan'].value_counts() 

    # Assign X and y variables 
    X = df.raw_name.values 
    X2 = df.name.values 
    X3 = df.gender.values 
    X4 = df.location.values 
    y = df.ethnicity_scan.values 

    # Feature extraction functions 
    def feature_full_name(nameString): 
     try: 
      full_name = nameString 
      if len(full_name) > 1: # not accept name with only 1 character 
       return full_name 
      else: return '?' 
     except: return '?' 

    def feature_full_last_name(nameString): 
     try: 
      last_name = nameString.rsplit(None, 1)[-1] 
      if len(last_name) > 1: # not accept name with only 1 character 
       return last_name 
      else: return '?' 
     except: return '?' 

    def feature_full_first_name(nameString): 
     try: 
      first_name = nameString.rsplit(' ', 1)[0] 
      if len(first_name) > 1: # not accept name with only 1 character 
       return first_name 
      else: return '?' 
     except: return '?' 

    # Transform format of X variables, and spit out a numpy array for all features 
    my_dict = [{'last-name': feature_full_last_name(i)} for i in X] 
    my_dict5 = [{'first-name': feature_full_first_name(i)} for i in X] 

    all_dict = [] 
    for i in range(0, len(my_dict)): 
     temp_dict = dict(
      my_dict[i].items() + my_dict5[i].items() 
      ) 
     all_dict.append(temp_dict) 

    newX = dv.fit_transform(all_dict) 

    # Separate the training and testing data sets 
    X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit) 

    # Fitting X and y into model, using training data 
    classifierUsed2.fit(X_train, y_train) 

    # Making predictions using trained data 
    y_train_predictions = classifierUsed2.predict(X_train) 
    y_test_predictions = classifierUsed2.predict(X_test) 

不過,我想例如預測只是一個單一的名字「約翰卡特「並預測種族標籤。我更換了y_train_predictions = classifierUsed2.predict(X_train)y_train_predictions = classifierUsed2.predict(X_train)與下面的行,但導致的錯誤:

print classifierUsed2.predict(["John Carter"]) 

#error 
Error: X has 1 features per sample; expecting 103916 
+0

試着這麼做classifierUsed2.predict(dv.transform(「約翰·卡特」)) – Stergios

+0

謝謝,但它說:「錯誤:‘海峽’對象有沒有屬性‘iteritems’」 – KubiK888

回答

0

您需要在完全相同的方式來改變你的數據作爲訓練之一,因此像(如果你輸入的數據是隻是列表字符串)

classifierUsed2.predict(dv.transform(["John Carter"]))