2

我聽說有人說你可以調整閾值來調整精度和召回之間的平衡,但我找不到如何做到這一點的實際例子。如何更改python scikit-learn中的精度和召回閾值?

我的代碼:

for i in mass[k]: 
    df = df_temp # reset df before each loop 
    #$$ 
    #$$ 
    if 1==1: 
    ###if i == singleEthnic: 
     count+=1 
     ethnicity_tar = str(i) # fr, en, ir, sc, others, ab, rus, ch, it, jp 
     # fn, metis, inuit; algonquian, iroquoian, athapaskan, wakashan, siouan, salish, tsimshian, kootenay 
     ############################################ 
     ############################################ 

     def ethnicity_target(row): 
      try: 
       if row[ethnicity_var] == ethnicity_tar: 
        return 1 
       else: 
        return 0 
      except: return None 
     df['ethnicity_scan'] = df.apply(ethnicity_target, axis=1) 
     print '1=', ethnicity_tar 
     print '0=', 'non-'+ethnicity_tar 

     # Random sampling a smaller dataframe for debugging 
     rows = df.sample(n=subsample_size, random_state=seed) # Seed gives fixed randomness 
     df = DataFrame(rows) 
     print 'Class count:' 
     print df['ethnicity_scan'].value_counts() 

     # Assign X and y variables 
     X = df.raw_name.values 
     X2 = df.name.values 
     X3 = df.gender.values 
     X4 = df.location.values 
     y = df.ethnicity_scan.values 

     # Feature extraction functions 
     def feature_full_name(nameString): 
      try: 
       full_name = nameString 
       if len(full_name) > 1: # not accept name with only 1 character 
        return full_name 
       else: return '?' 
      except: return '?' 

     def feature_full_last_name(nameString): 
      try: 
       last_name = nameString.rsplit(None, 1)[-1] 
       if len(last_name) > 1: # not accept name with only 1 character 
        return last_name 
       else: return '?' 
      except: return '?' 

     def feature_full_first_name(nameString): 
      try: 
       first_name = nameString.rsplit(' ', 1)[0] 
       if len(first_name) > 1: # not accept name with only 1 character 
        return first_name 
       else: return '?' 
      except: return '?' 

     # Transform format of X variables, and spit out a numpy array for all features 
     my_dict = [{'last-name': feature_full_last_name(i)} for i in X] 
     my_dict5 = [{'first-name': feature_full_first_name(i)} for i in X] 

     all_dict = [] 
     for i in range(0, len(my_dict)): 
      temp_dict = dict(
       my_dict[i].items() + my_dict5[i].items() 
       ) 
      all_dict.append(temp_dict) 

     newX = dv.fit_transform(all_dict) 

     # Separate the training and testing data sets 
     X_train, X_test, y_train, y_test = cross_validation.train_test_split(newX, y, test_size=testTrainSplit) 

     # Fitting X and y into model, using training data 
     classifierUsed2.fit(X_train, y_train) 

     # Making predictions using trained data 
     y_train_predictions = classifierUsed2.predict(X_train) 
     y_test_predictions = classifierUsed2.predict(X_test) 

我試圖取代線"y_test_predictions = classifierUsed2.predict(X_test)" with "y_test_predictions = classifierUsed2.predict(X_test) > 0.8""y_test_predictions = classifierUsed2.predict(X_test) > 0.01",沒有劇烈變化。

+0

感謝DoughnutZombie,你能告訴我如何灰色突出文本? – KubiK888

+0

要標記內聯代碼,請在開始和結束時使用反引號。另見http://stackoverflow.com/editing-help例如在最底層的「評論格式」。 –

+1

對你的問題:你使用什麼分類器?分類器不是'predict',而是'predict_proba'?因爲預測通常只輸出1s和0s。 'predict_proba'輸出一個你可以設定的浮點數。 –

回答

2

classifierUsed2.predict(X_test)僅輸出每個樣本的預測類別(最可能是0和1)。你想要的是classifierUsed2.predict_proba(X_test),它輸出一個2d數組,每個樣本的每個類別具有概率。要做到,你可以做類似的閾值:

y_test_probabilities = classifierUsed2.predict_proba(X_test) 
# y_test_probabilities has shape = [n_samples, n_classes] 

y_test_predictions_high_precision = y_test_probabilities[:,1] > 0.8 
y_test_predictions_high_recall = y_test_probabilities[:,1] > 0.1 

y_test_predictions_high_precision將包含抽樣方法,相當肯定是一流的1,而y_test_predictions_high_recall就會更經常地預測1類(和達到更高的回憶),但也包含許多誤報。

predict_proba由您使用的兩個分類器,邏輯迴歸和SVM支持。

相關問題