如何使用scikit準確分類具有大量潛在值的文本？

我有多種黑名單術語，我想在文本段落的語料庫中標識。每個術語長度大約爲1 - 5個字，並且包含我不想在我的文檔集中使用的某些關鍵字。如果一個術語或類似的東西在語料庫中被識別出來，我希望它從我的語料庫中刪除。如何使用scikit準確分類具有大量潛在值的文本？

拋開一邊，我正努力在我的語料庫中準確識別這些術語。我使用scikit學習，並試圖兩個不同的方法：使用TF-IDF矢量要素與列入黑名單的條款和作爲訓練數據清理方面的混合

一個MultinomialNB分類方法。
OneClassSVM方法，其中僅列入黑名單的關鍵字用作訓練數據，並且任何傳入的文本看起來都不像列入黑名單的術語被視爲異常值。

這裏是我的OnceClassSVm辦法代碼：

df = pd.read_csv("keyword_training_blacklist.csv") 

keywords_list = df['Keyword'] 

pipeline = Pipeline([ 
    ('vect', CountVectorizer(analyzer='char_wb', max_df=0.75, min_df=1, ngram_range=(1, 5))), 
    # strings to token integer counts 
    ('tfidf', TfidfTransformer(use_idf=False, norm='l2')), # integer counts to weighted TF-IDF scores 
    ('clf', OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)), # train on TF-IDF vectors w/ Naive Bayes classifier 
]) 

kf = KFold(len(keywords_list), 8) 
for train_index, test_index in kf: 
    # make training and testing datasets 
    X_train, X_test = keywords_list[train_index], keywords_list[test_index] 

    pipeline.fit(X_train) # Train classifier using training data and labels 
    predicted = pipeline.predict(X_test) 
    print(predicted[predicted == 1].size/predicted.size) 

csv_df = pd.read_csv("corpus.csv") 

testCorpus = csv_df['Terms'] 

testCorpus = testCorpus.drop_duplicates() 


for s in testCorpus: 
    if pipeline.predict([s])[0] == 1: 
     print(s)

在實踐中，我得到許多假陽性，當我試圖在我的語料庫算法通過。我列入黑名單的培訓數據約爲3000條。我的訓練數據的大小是否需要進一步增加，還是我缺少明顯的東西？

來源

2016-03-10 GreenGodot

你的實際特徵是什麼 - 只是單個詞？你是否嘗試過使用相鄰單詞對？另外，您的意思是「一個術語或類似的術語」 - 語義上相似，還是在一定的編輯距離內，或其他什麼？ – tripleee

您是否在嘗試移除包含這些字詞的文檔？或術語本身？你爲什麼不使用正則表達式？ –

我想將類似拼寫的術語用於黑名單中的術語。一個術語將是一個簡單的字符串，例如「這是一個術語」和「羅蘭ipsum」。一個被列入黑名單的術語將是「性感女孩」，我想看到類似的詞條，比如「sexxy girls」。我查找了Levenshtein距離等方法，但我不確定它們是否可以包含在ML算法中。一開始，正則表達式的方法聽起來很明顯，但我擁有數以千計的黑名單術語和數百萬條術語，這解釋了我對ML方法的需求。 – GreenGodot

嘗試使用difflib來識別語料庫中與每個黑名單項最接近的匹配項。

import difflib 
from nltk.util import ngrams 

words = corpus.split(' ') # split corpus to words based on spaces (can be improved) 

words_ngrams = [] # ngrams from 1 to 5 words 
for n in range(1,6): 
    words_ngrams.extend(' '.join(ngrams(words, n))) 


to_delete = [] # will contain tuples (index, length) of matched terms to delete from corpus. 
sim_rate = 0.8 # similarity rate 
max_matches = 4 # maximum number of matches for each term 
for term in terms: 
    matches = difflib.get_close_matches(term,words_ngrams,n=max_matches,cutoff=sim_rate) 
    for match in matches: 
     to_delete.append((corpus.index(match), len(match)))

您還可以，如果你想獲得的條款和n元組之間的相似性得分的使用difflib.SequenceMatcher。

來源

2016-03-21 09:10:34

如何使用scikit準確分類具有大量潛在值的文本？

回答

相關問題