2013-01-09 18 views
0

嗨,我將推文分爲7類。我有大約25萬次訓練推文和另外250,000次測試推文。我的代碼可以在下面找到。 training.pkl是訓練推文,testing.pkl是測試推文。你也可以看到相應的標籤。scikit-learn:分類時機正確嗎?

當我執行我的代碼時,我發現將測試集(原始)轉換爲特徵空間需要14.9649999142秒的時間。而且我還測量了對測試集中所有推文進行分類需要多長時間,這是0.131999969482秒。

雖然這對我來說似乎不太可能,該框架能夠在0.131999969482秒內對250,000條推文進行分類。我的問題是現在,這是正確的嗎?

file = open("training.pkl", 'rb') 
training = cPickle.load(file) 
file.close() 


file = open("testing.pkl", 'rb') 
testing = cPickle.load(file) 
file.close() 

file = open("ground_truth_testing.pkl", 'rb') 
ground_truth_testing = cPickle.load(file) 
file.close() 

file = open("ground_truth_training.pkl", 'rb') 
ground_truth_training = cPickle.load(file) 
file.close() 


print 'data loaded' 
tweetsTestArray = np.array(testing) 
tweetsTrainingArray = np.array(training) 
y_train = np.array(ground_truth_training) 


# Transform dataset to a design matrix with TFIDF and 1,2 gram 
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 2)) 

X_train = vectorizer.fit_transform(tweetsTrainingArray) 
print "n_samples: %d, n_features: %d" % X_train.shape 


print 'COUNT' 
_t0 = time.time() 
X_test = vectorizer.transform(tweetsTestArray) 
print "n_samples: %d, n_features: %d" % X_test.shape 
_t1 = time.time() 

print _t1 - _t0 
print 'STOP' 

# TRAINING & TESTING 

print 'SUPERVISED' 
print '----------------------------------------------------------' 
print 

print 'SGD' 

#Initialize Stochastic Gradient Decent 
sgd = linear_model.SGDClassifier(loss='modified_huber',alpha = 0.00003, n_iter = 25) 

#Train 
sgd.fit(X_train, ground_truth_training) 

#Predict 

print "START COUNT" 
_t2 = time.time() 
target_sgd = sgd.predict(X_test) 
_t3 = time.time() 

print _t3 -_t2 
print "END COUNT" 

# Print report 
report_sgd = classification_report(ground_truth_testing, target_sgd) 
print report_sgd 
print 

X_train印刷

<248892x213162 sparse matrix of type '<type 'numpy.float64'>' 
    with 4346880 stored elements in Compressed Sparse Row format> 

X_train printen

<249993x213162 sparse matrix of type '<type 'numpy.float64'>' 
    with 4205309 stored elements in Compressed Sparse Row format> 

回答

2

是什麼形狀和非零特徵在所提取的X_trainX_test稀疏矩陣是多少?它們是否與你的語料庫中的單詞數目近似相關?

分類預計將比線性模型的特徵提取快得多。它只是計算點積,因此與非零點的數量(即近似於測試集中的單詞數量)直接成線性關係。

編輯:得到統計上稀疏矩陣X_trainX_test只是做的內容:

>>> print repr(X_train) 
>>> print repr(X_test) 

編輯2:您的數字看起來不錯。對數值特徵的線性模型預測確實比特徵提取要快得多:

>>> from sklearn.datasets import fetch_20newsgroups 
>>> from sklearn.feature_extraction.text import TfidfVectorizer 
>>> twenty = fetch_20newsgroups() 
>>> %time X = TfidfVectorizer().fit_transform(twenty.data) 
CPU times: user 10.74 s, sys: 0.32 s, total: 11.06 s 
Wall time: 11.04 s 

>>> X 
<11314x56436 sparse matrix of type '<type 'numpy.float64'>' 
    with 1713894 stored elements in Compressed Sparse Row format> 
>>> from sklearn.linear_model import SGDClassifier 

>>> %time clf = SGDClassifier().fit(X, twenty.target) 
CPU times: user 0.50 s, sys: 0.01 s, total: 0.51 s 
Wall time: 0.51 s 

>>> %time clf.predict(X) 
CPU times: user 0.10 s, sys: 0.00 s, total: 0.11 s 
Wall time: 0.11 s 
array([7, 4, 4, ..., 3, 1, 8]) 
+0

它們都是稀疏矩陣。我已將print X_train的輸出添加到問題中。所以你認爲這是正常的和可能的? – Ojtwist

+0

事實上,我犯了一個錯誤,打印你需要'repr'內建的統計信息。我再次編輯我的答案。 – ogrisel

+0

添加了輸出 – Ojtwist