用n元組分類

我想使用使用n元組特徵的sklearn分類器。此外，我想進行交叉驗證以找出n-gram的最佳順序。然而，我有點卡住我如何能夠把所有的東西放在一起。用n元組分類

現在，我有以下代碼：

import pandas as pd 
import numpy as np 
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import KFold 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB 

text = ... # This is the input text. A list of strings 
labels = ... # These are the labels of each sentence 
# Find the optimal order of the ngrams by cross-validation 
scores = pd.Series(index=range(1,6), dtype=float) 
folds = KFold(n_splits=3) 

for n in range(1,6): 
    count_vect = CountVectorizer(ngram_range=(n,n), stop_words='english') 
    X = count_vect.fit_transform(text) 
    X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42) 
    clf = MultinomialNB() 
    score = cross_val_score(clf, X_train, y_train, cv=folds, n_jobs=-1) 
    scores.loc[n] = np.mean(score) 

# Evaluate the classifier using the best order found 
order = scores.idxmax() 
count_vect = CountVectorizer(ngram_range=(order,order), stop_words='english') 
X = count_vect.fit_transform(text) 
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.33, random_state=42) 
clf = MultinomialNB() 
clf = clf.fit(X_train, y_train) 
acc = clf.score(X_test, y_test) 
print('Accuracy is {}'.format(acc))

不過，我覺得這是錯誤的方式做到這一點，因爲我創造的每一個循環列車測試分裂。

如果做的列車測試預先分割並分別應用到CountVectorizer兩個部分，除了這些部分具有不同shape s表示，採用clf.fit和clf.score時會引起問題。

我該如何解決這個問題？

編輯：如果我嘗試先建立一個詞彙，我還是要多建幾個詞彙，由於對unigram的詞彙是從二元語法的不同，等

舉個例子：

# unigram vocab 
vocab = set() 
for sentence in text: 
    for word in sentence: 
     if word not in vocab: 
      vocab.add(word) 
len(vocab) # 47291 

# bigram vocab 
vocab = set() 
for sentence in text: 
    bigrams = nltk.ngrams(sentence, 2) 
    for bigram in bigrams: 
     if bigram not in vocab: 
      vocab.add(bigram) 
len(vocab) # 326044

這再一次導致我需要爲每個n-gram大小應用CountVectorizer的相同問題。

來源

2017-06-02 JNevens

構建的詞彙首先，從訓練集。沒有什麼能夠阻止你把這兩個單詞和bigrams（以及更多）放在同一個字典中。 – alexis

您需要先設置vocabulary參數。在某些方面，你必須提供整個詞彙，否則維度永遠不會匹配（顯然）。如果您先進行火車/測試劃分，則可能會出現一組中不存在的單詞，並且會導致尺寸不匹配。

The documentation說：

如果你不能提供一個先驗字典，你不使用的分析，做某種特徵選擇則的特徵數量將等於找到了詞彙量做通過分析數據。

再往下看，你會發現對vocabulary的描述。

詞彙：
映射或可迭代，可選無論是映射（例如，一個字典），其中鍵是術語和值在特徵矩陣索引，或可迭代以上條款。如果沒有給出，則從輸入文件中確定詞彙。映射中的指數不應該重複，並且不應該在0和最大指數之間有任何差距。

來源

2017-06-02 17:21:22 displayname

好吧，我會做以下事情。我得到'text'中的所有單詞列表，這是'vocab'。然後，我可以使用'text'和'labels'進行火車測試分割。之後，我可以在這些單獨的部件上執行'CountVectorizer'，同時將'vocabulary'參數設置爲'vocab'。正確？ – JNevens

@JNevens是的，這應該工作。最後，您的* n *維中每個單詞的特徵向量，其中* n *是整個語料庫中單詞的數量。您的模型將接受* n維向量的訓練，這意味着您無法以某種方式更改維度的數量 - 您的模型應如何分類* m *維模型？ – displayname

正如我的問題所述，我想嘗試使用不同的n-gram順序的分類器。因此，如果我使用1克或2克的'CountVectorizer'，這些詞彙大小又有所不同，因爲前者的所有詞彙都是詞彙，而後者的所有詞彙都是詞彙。 – JNevens

回答

相關問題