2016-02-19 21 views
3

我想將文檔中的文本分類到不同的類別。每個文檔只能進入以下類別之一:PR,AR,KID,SAR。使用scikit-learn來區分類似的類別

我發現使用scikit學習和我能夠使用它的一個例子:

import numpy 
from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import LinearSVC 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.multiclass import OneVsRestClassifier 
from pandas import DataFrame 

def build_data_frame(path, classification): 
    rows = [] 
    index = [] 

    f = open(path, mode = 'r', encoding="utf8") 
    txt = f.read() 

    rows.append({'text': txt, 'class': classification}) 
    index.append(path) 

    data_frame = DataFrame(rows, index=index) 
    return data_frame 

# Categories 
PR = 'PR' 
AR = 'AR' 
KID = 'KID' 
SAR = 'SAR' 

# Training documents 
SOURCES = [ 
    (r'C:/temp_training/PR/PR1.txt', PR), 
    (r'C:/temp_training/PR/PR2.txt', PR), 
    (r'C:/temp_training/PR/PR3.txt', PR), 
    (r'C:/temp_training/PR/PR4.txt', PR), 
    (r'C:/temp_training/PR/PR5.txt', PR), 
    (r'C:/temp_training/AR/AR1.txt', AR), 
    (r'C:/temp_training/AR/AR2.txt', AR), 
    (r'C:/temp_training/AR/AR3.txt', AR), 
    (r'C:/temp_training/AR/AR4.txt', AR), 
    (r'C:/temp_training/AR/AR5.txt', AR), 
    (r'C:/temp_training/KID/KID1.txt', KID), 
    (r'C:/temp_training/KID/KID2.txt', KID), 
    (r'C:/temp_training/KID/KID3.txt', KID), 
    (r'C:/temp_training/KID/KID4.txt', KID), 
    (r'C:/temp_training/KID/KID5.txt', KID), 
    (r'C:/temp_training/SAR/SAR1.txt', SAR), 
    (r'C:/temp_training/SAR/SAR2.txt', SAR), 
    (r'C:/temp_training/SAR/SAR3.txt', SAR), 
    (r'C:/temp_training/SAR/SAR4.txt', SAR), 
    (r'C:/temp_training/SAR/SAR5.txt', SAR) 
] 

# Real documents 
TESTS = [ 
    (r'C:/temp_testing/PR/PR1.txt'), 
    (r'C:/temp_testing/PR/PR2.txt'), 
    (r'C:/temp_testing/PR/PR3.txt'), 
    (r'C:/temp_testing/PR/PR4.txt'), 
    (r'C:/temp_testing/PR/PR5.txt'), 
    (r'C:/temp_testing/AR/AR1.txt'), 
    (r'C:/temp_testing/AR/AR2.txt'), 
    (r'C:/temp_testing/AR/AR3.txt'), 
    (r'C:/temp_testing/AR/AR4.txt'), 
    (r'C:/temp_testing/AR/AR5.txt'), 
    (r'C:/temp_testing/KID/KID1.txt'), 
    (r'C:/temp_testing/KID/KID2.txt'), 
    (r'C:/temp_testing/KID/KID3.txt'), 
    (r'C:/temp_testing/KID/KID4.txt'), 
    (r'C:/temp_testing/KID/KID5.txt'), 
    (r'C:/temp_testing/SAR/SAR1.txt'), 
    (r'C:/temp_testing/SAR/SAR2.txt'), 
    (r'C:/temp_testing/SAR/SAR3.txt'), 
    (r'C:/temp_testing/SAR/SAR4.txt'), 
    (r'C:/temp_testing/SAR/SAR5.txt') 
] 

data_train = DataFrame({'text': [], 'class': []}) 
for path, classification in SOURCES: 
    data_train = data_train.append(build_data_frame(path, classification)) 

data_train = data_train.reindex(numpy.random.permutation(data_train.index)) 

examples = [] 

for path in TESTS: 
    f = open(path, mode = 'r', encoding = 'utf8') 
    txt = f.read() 

    examples.append(txt) 

target_names = [PR, AR, KID, SAR] 

classifier = Pipeline([ 
    ('vectorizer', CountVectorizer(ngram_range=(1, 2), analyzer='word', strip_accents='unicode', stop_words='english')), 
    ('tfidf', TfidfTransformer()), 
    ('clf', OneVsRestClassifier(LinearSVC()))]) 
classifier.fit(data_train['text'], data_train['class']) 
predicted = classifier.predict(examples) 

print(predicted) 

輸出:

['PR' 'PR' 'PR' 'PR' 'PR' 'AR' 'AR' 'AR' 'AR' 'AR' 'KID' 'KID' 'KID' 'KID' 
'KID' 'AR' 'AR' 'AR' 'SAR' 'AR'] 

PR,AR和KID是完全認可。

但是,SAR文件(最後5個)沒有正確分類,除了其中一個。 SAR和AR非常相似,這可以解釋算法爲什麼會混淆。

我試圖玩n-grams值,但1(min)和2(max)似乎給出了最好的結果。

  • 任何想法如何提高區分AR和SAR類別的精度?

  • 有沒有辦法顯示特定文件的識別百分比?即PR(70%),這意味着該算法的預測70%的信心

如果你需要的文件,這裏是集:http://1drv.ms/21dnL6j

回答

1

這不是嚴格意義上的編程問題,所以我建議您嘗試將其發佈到更多的數據科學相關堆棧。

反正有些事情你可以嘗試:

  • 使用一些其他的分類。
  • 使用網格搜索調整分類器超參數。
  • 使用OneVsOne代替OneVsAll作爲策略。這可能有助於您將SAR與AR區分開來。
  • 對於「顯示特定文檔的識別百分比」,您可以使用來自某些模型的概率輸出。使用classifier.predict_proba函數代替classifier.predict之一。

祝你好運!

相關問題