嗨,我將推文分爲7類。我有大約25萬次訓練推文和另外250,000次測試推文。我的代碼可以在下面找到。 training.pkl是訓練推文,testing.pkl是測試推文。你也可以看到相應的標籤。scikit-learn:分類時機正確嗎?
當我執行我的代碼時,我發現將測試集(原始)轉換爲特徵空間需要14.9649999142秒的時間。而且我還測量了對測試集中所有推文進行分類需要多長時間,這是0.131999969482秒。
雖然這對我來說似乎不太可能,該框架能夠在0.131999969482秒內對250,000條推文進行分類。我的問題是現在,這是正確的嗎?
file = open("training.pkl", 'rb')
training = cPickle.load(file)
file.close()
file = open("testing.pkl", 'rb')
testing = cPickle.load(file)
file.close()
file = open("ground_truth_testing.pkl", 'rb')
ground_truth_testing = cPickle.load(file)
file.close()
file = open("ground_truth_training.pkl", 'rb')
ground_truth_training = cPickle.load(file)
file.close()
print 'data loaded'
tweetsTestArray = np.array(testing)
tweetsTrainingArray = np.array(training)
y_train = np.array(ground_truth_training)
# Transform dataset to a design matrix with TFIDF and 1,2 gram
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, ngram_range=(1, 2))
X_train = vectorizer.fit_transform(tweetsTrainingArray)
print "n_samples: %d, n_features: %d" % X_train.shape
print 'COUNT'
_t0 = time.time()
X_test = vectorizer.transform(tweetsTestArray)
print "n_samples: %d, n_features: %d" % X_test.shape
_t1 = time.time()
print _t1 - _t0
print 'STOP'
# TRAINING & TESTING
print 'SUPERVISED'
print '----------------------------------------------------------'
print
print 'SGD'
#Initialize Stochastic Gradient Decent
sgd = linear_model.SGDClassifier(loss='modified_huber',alpha = 0.00003, n_iter = 25)
#Train
sgd.fit(X_train, ground_truth_training)
#Predict
print "START COUNT"
_t2 = time.time()
target_sgd = sgd.predict(X_test)
_t3 = time.time()
print _t3 -_t2
print "END COUNT"
# Print report
report_sgd = classification_report(ground_truth_testing, target_sgd)
print report_sgd
print
X_train印刷
<248892x213162 sparse matrix of type '<type 'numpy.float64'>'
with 4346880 stored elements in Compressed Sparse Row format>
X_train printen
<249993x213162 sparse matrix of type '<type 'numpy.float64'>'
with 4205309 stored elements in Compressed Sparse Row format>
它們都是稀疏矩陣。我已將print X_train的輸出添加到問題中。所以你認爲這是正常的和可能的? – Ojtwist
事實上,我犯了一個錯誤,打印你需要'repr'內建的統計信息。我再次編輯我的答案。 – ogrisel
添加了輸出 – Ojtwist