2016-04-28 66 views
0

我正在使用NLTK工具包進行項目工作。使用我擁有的硬件,我可以在小數據集上運行分類器對象。因此,我將數據分成更小的塊並在其中運行分類器對象,同時將所有這些單獨的對象存儲在pickle文件中。如何合併NLTK中的NaiveBayesClassifier對象

現在爲了測試我需要將整個對象作爲一個來獲得更好的結果。所以我的問題是如何將這些對象合併爲一個。

objs = [] 

while True: 
    try: 
     f = open(picklename,"rb") 
     objs.extend(pickle.load(f)) 
     f.close() 
    except EOFError: 
     break 

這樣做不起作用。它給出了錯誤TypeError: 'NaiveBayesClassifier' object is not iterable

NaiveBayesClassifier代碼:

classifier = nltk.NaiveBayesClassifier.train(training_set) 
+0

'NaiveBayesClassifier'的代碼是怎麼樣的? – Omid

+0

@Omid它是一個工具包。我編輯了我的問題,顯示分類器。 – Arkham

回答

0

我不知道你的數據的確切格式,但你不能簡單地合併不同的分類。樸素貝葉斯分類器根據訓練數據存儲概率分佈,並且無法訪問原始數據就無法合併概率分佈。

如果你看看源代碼在這裏:http://www.nltk.org/_modules/nltk/classify/naivebayes.html 分類存儲的實例:

self._label_probdist = label_probdist 
self._feature_probdist = feature_probdist 

這些在使用相對頻率計數火車法計算。 (例如,P(L_1)=(訓練集中的L1的數量)/(訓練集中的#個標籤))要組合這兩者,你需要得到(列車1 +列車2中的L1的數量)/在T1 + T2中)

然而,樸素的貝葉斯程序並不是很難從零開始實現,特別是如果您按照上面鏈接中的「火車」源代碼進行操作,下面是一個大綱,使用NaiveBayes源代碼代碼

  1. 存儲 'FreqDist' 爲標籤和特徵的數據的每個子集對象。

    label_freqdist = FreqDist() 
    feature_freqdist = defaultdict(FreqDist) 
    feature_values = defaultdict(set) 
    fnames = set() 
    
    # Count up how many times each feature value occurred, given 
    # the label and featurename. 
    for featureset, label in labeled_featuresets: 
        label_freqdist[label] += 1 
        for fname, fval in featureset.items(): 
         # Increment freq(fval|label, fname) 
         feature_freqdist[label, fname][fval] += 1 
         # Record that fname can take the value fval. 
         feature_values[fname].add(fval) 
         # Keep a list of all feature names. 
         fnames.add(fname) 
    
    # If a feature didn't have a value given for an instance, then 
    # we assume that it gets the implicit value 'None.' This loop 
    # counts up the number of 'missing' feature values for each 
    # (label,fname) pair, and increments the count of the fval 
    # 'None' by that amount. 
    for label in label_freqdist: 
        num_samples = label_freqdist[label] 
        for fname in fnames: 
         count = feature_freqdist[label, fname].N() 
         # Only add a None key when necessary, i.e. if there are 
         # any samples with feature 'fname' missing. 
         if num_samples - count > 0: 
          feature_freqdist[label, fname][None] += num_samples - count 
          feature_values[fname].add(None) 
    # Use pickle to store label_freqdist, feature_freqdist,feature_values 
    
  2. 結合那些使用他們內置的「添加」方法。這將允許您獲取所有數據的相對頻率。

    all_label_freqdist = FreqDist() 
    all_feature_freqdist = defaultdict(FreqDist) 
    all_feature_values = defaultdict(set) 
    
    for file in train_labels: 
        f = open(file,"rb") 
        all_label_freqdist += pickle.load(f) 
        f.close() 
    
    # Combine the default dicts for features similarly 
    
  3. 使用「估計量」來創建概率分佈。

    estimator = ELEProbDist() 
    
    label_probdist = estimator(all_label_freqdist) 
    
    # Create the P(fval|label, fname) distribution 
    feature_probdist = {} 
    for ((label, fname), freqdist) in all_feature_freqdist.items(): 
        probdist = estimator(freqdist, bins=len(all_feature_values[fname])) 
        feature_probdist[label, fname] = probdist 
    
    classifier = NaiveBayesClassifier(label_probdist, feature_probdist) 
    

的分類不會在所有數據相結合的計數,併產生你所需要的。