2017-07-17 64 views
0

我有一個訓練有素的分類器,一直工作正常。Python機器學習訓練Classifer錯誤指數是越界

我試圖修改它來處理多個.csv文件使用循環,但是這已經打破它,原始代碼(這是工作正常)現在返回與.csv文件相同的錯誤它的點它以前處理沒有任何問題。

我非常困惑,看不到什麼會突然導致此錯誤出現之前,一切工作正常。原始(工作)代碼是;

# -*- coding: utf-8 -*- 

    import csv 
    import pandas 
    import numpy as np 
    import sklearn.ensemble as ske 
    import re 
    import os 
    import collections 
    import pickle 
    from sklearn.externals import joblib 
    from sklearn import model_selection, tree, linear_model, svm 


    # Load dataset 
    url = 'test_6_During_100.csv' 
    dataset = pandas.read_csv(url) 
    dataset.set_index('Name', inplace = True) 
    ##dataset = dataset[['ProcessorAffinity','ProductVersion','Handle','Company', 
    ##   'UserProcessorTime','Path','Product','Description',]] 

    # Open file to output everything to 
    new_url = re.sub('\.csv$', '', url) 
    f = open(new_url + " output report", 'w') 
    f.write(new_url + " output report\n") 
    f.write("\n") 


    # shape 
    print(dataset.shape) 
    print("\n") 
    f.write("Dataset shape " + str(dataset.shape) + "\n") 
    f.write("\n") 

    clf = joblib.load(os.path.join(
      os.path.dirname(os.path.realpath(__file__)), 
      'classifier/classifier.pkl')) 


    Class_0 = [] 
    Class_1 = [] 
    prob = [] 

    for index, row in dataset.iterrows(): 
     res = clf.predict([row]) 
     if res == 0: 
      if index in malware: 
       Class_0.append(index) 
      elif index in Class_1: 
       Class_1.append(index)   
      else: 
       print "Is ", index, " recognised?" 
       designation = raw_input() 

       if designation == "No": 
        Class_0.append(index) 
       else: 
        Class_1.append(index) 

    dataset['Type'] = 1      
    dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0 

    print "\n" 

    results = [] 

    results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0])) 
    print (results) 

    X = dataset.drop(['Type'], axis=1).values 
    Y = dataset['Type'].values 


    clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True) 
    clf.fit(X, Y) 
    joblib.dump(clf, 'classifier/classifier.pkl') 

    output = collections.Counter(Class_0) 

    print "Class_0; \n" 
    f.write ("Class_0; \n") 

    for key, value in output.items():  
     f.write(str(key) + " ; " + str(value) + "\n") 
     print(str(key) + " ; " + str(value)) 

    print "\n" 
    f.write ("\n") 

    output_1 = collections.Counter(Class_1) 

    print "Class_1; \n" 
    f.write ("Class_1; \n") 

    for key, value in output_1.items():  
     f.write(str(key) + " ; " + str(value) + "\n") 
     print(str(key) + " ; " + str(value)) 

    print "\n" 

    f.close() 

我的新代碼是一樣的,但是包裹的一對夫婦嵌套循環內,以保持腳本運行,同時有文件的文件夾內的過程中,新的代碼(代碼導致錯誤)低於;

# -*- coding: utf-8 -*- 

import csv 
import pandas 
import numpy as np 
import sklearn.ensemble as ske 
import re 
import os 
import time 
import collections 
import pickle 
from sklearn.externals import joblib 
from sklearn import model_selection, tree, linear_model, svm 

# Our arrays which we'll store our process details in and then later print out data for 
Class_0 = [] 
Class_1 = [] 
prob = [] 
results = [] 

# Open file to output our report too 
timestr = time.strftime("%Y%m%d%H%M%S") 

f = open(timestr + " output report.txt", 'w') 
f.write(timestr + " output report\n") 
f.write("\n") 

count = len(os.listdir('.')) 

while (count > 0): 
    # Load dataset 
    for filename in os.listdir('.'): 
      if filename.endswith('.csv') and filename.startswith("processes_"): 

       url = filename 

       dataset = pandas.read_csv(url) 
       dataset.set_index('Name', inplace = True) 

       clf = joblib.load(os.path.join(
         os.path.dirname(os.path.realpath(__file__)), 
         'classifier/classifier.pkl'))    

       for index, row in dataset.iterrows(): 
        res = clf.predict([row]) 
        if res == 0: 
         if index in Class_0: 
          Class_0.append(index) 
         elif index in Class_1: 
          Class_1.append(index)   
         else: 
          print "Is ", index, " recognised?" 
          designation = raw_input() 

          if designation == "No": 
           Class_0.append(index) 
          else: 
           Class_1.append(index) 

       dataset['Type'] = 1      
       dataset.loc[dataset.index.str.contains('|'.join(Class_0)), 'Type'] = 0 

       print "\n" 

       results.append(collections.OrderedDict.fromkeys(dataset.index[dataset['Type'] == 0])) 
       print (results) 

       X = dataset.drop(['Type'], axis=1).values 
       Y = dataset['Type'].values 


       clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True) 
       clf.fit(X, Y) 
       joblib.dump(clf, 'classifier/classifier.pkl') 

       os.remove(filename) 


output = collections.Counter(Class_0) 

print "Class_0; \n" 
f.write ("Class_0; \n") 

for key, value in output.items():  
    f.write(str(key) + " ; " + str(value) + "\n") 
    print(str(key) + " ; " + str(value)) 

print "\n" 
f.write ("\n") 

output_1 = collections.Counter(Class_1) 

print "Class_1; \n" 
f.write ("Class_1; \n") 

for key, value in output_1.items():  
    f.write(str(key) + " ; " + str(value) + "\n") 
    print(str(key) + " ; " + str(value)) 

print "\n" 

f.close() 

誤差(IndexError: index 1 is out of bounds for size 1)被引用預測線res = clf.predict([row])。據我所知,問題在於沒有足夠的「類」或數據的標籤類型(我正在尋找二元分類器)?但我一直在使用這個確切的方法(在嵌套循環之外),沒有任何問題。

https://codeshare.io/Gkpb44 - 包含我的.csv數據上面的代碼共享鏈接提到.csv文件。

回答

0

所以我已經意識到了問題所在。

我創建在分級加載,然後使用warm_start我重新擬合數據更新分類,試圖仿效增量/在線學習的格式。當我處理包含這兩種類型的數據時,這很有效。但是,如果數據只是積極的,那麼當我重新適應分類器時就會破壞它。

現在我已經評論了以下內容;

clf.set_params(n_estimators = len(clf.estimators_) + 40, warm_start = True) 
clf.fit(X, Y) 
joblib.dump(clf, 'classifier/classifier.pkl') 

已經解決了這個問題。展望未來,我可能會添加(又一個!)條件語句,看看我是否應該重新擬合數據。

我很想刪除這個問題,但我還沒有找到任何東西,我的搜索過程中涉及這一事實,我想我會在任何情況下的答案離開這個了發現他們有同樣的問題。

0

的問題是,[row]是長度的數組1.你的程序試圖訪問索引1,其不存在(索引從0開始)。看起來你可能想要做res = clf.predict(row)或者再看看行變量。希望這可以幫助。