我有2599個文件,從1到5的文件五個標記文本分類任務是如何使用scikit-learn和matplotlib繪製不平衡數據集的SVC分類?
label | texts
----------
5 |1190
4 |839
3 |239
1 |204
2 |127
所有準備將此劃文本具有非常低的性能數據,並同時獲得約不明確的指標警告:
Accuracy: 0.461057692308
score: 0.461057692308
precision: 0.212574195636
recall: 0.461057692308
'precision', 'predicted', average, warn_for)
confussion matrix:
[[ 0 0 0 0 153]
'precision', 'predicted', average, warn_for)
[ 0 0 0 0 94]
[ 0 0 0 0 194]
[ 0 0 0 0 680]
[ 0 0 0 0 959]]
clasification report:
precision recall f1-score support
1 0.00 0.00 0.00 153
2 0.00 0.00 0.00 94
3 0.00 0.00 0.00 194
4 0.00 0.00 0.00 680
5 0.46 1.00 0.63 959
avg/total 0.21 0.46 0.29 2080
顯然,這是事實,我有一個不平衡的數據集發生的事情,所以我發現這個paper這裏作者提出了幾個aproaches來解決這個問題:
The problem is that with imbalanced datasets, the learned boundary is too close to the positive instances. We need to bias SVM in a way that will push the boundary away from the positive instances. Veropoulos et al [14] suggest using different error costs for the positive (C +) and negative (C -) classes
我知道,這可能是非常複雜的,但SVC提供了幾個超參數,所以我的問題是:有沒有什麼辦法偏見SVC在與超參數,提供SVC分類推邊界遠離獨到之處實例的方式?。我知道這可能是一個困難的問題,但任何幫助是值得歡迎的,這要歸功於前輩們。
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf_vect= TfidfVectorizer(use_idf=True, smooth_idf=True, sublinear_tf=False, ngram_range=(2,2))
from sklearn.cross_validation import train_test_split, cross_val_score
import pandas as pd
df = pd.read_csv('/path/of/the/file.csv',
header=0, sep=',', names=['id', 'text', 'label'])
reduced_data = tfidf_vect.fit_transform(df['text'].values)
y = df['label'].values
from sklearn.decomposition.truncated_svd import TruncatedSVD
svd = TruncatedSVD(n_components=5)
reduced_data = svd.fit_transform(reduced_data)
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(reduced_data,
y, test_size=0.33)
#with no weights:
from sklearn.svm import SVC
clf = SVC(kernel='linear', class_weight={1: 10})
clf.fit(reduced_data, y)
prediction = clf.predict(X_test)
w = clf.coef_[0]
a = -w[0]/w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0]/w[1]
# get the separating hyperplane using weighted classes
wclf = SVC(kernel='linear', class_weight={1: 10})
wclf.fit(reduced_data, y)
ww = wclf.coef_[0]
wa = -ww[0]/ww[1]
wyy = wa * xx - wclf.intercept_[0]/ww[1]
# plot separating hyperplanes and samples
import matplotlib.pyplot as plt
h0 = plt.plot(xx, yy, 'k-', label='no weights')
h1 = plt.plot(xx, wyy, 'k--', label='with weights')
plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
plt.legend()
plt.axis('tight')
plt.show()
但我什麼也沒有,我不能明白髮生了什麼,這是劇情:
則:
#Let's show some metrics[unweighted]:
from sklearn.metrics.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', clf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
#Let's show some metrics[weighted]:
print 'weighted:\n'
from sklearn.metrics.metrics import precision_score, \
recall_score, confusion_matrix, classification_report, accuracy_score
print '\nAccuracy:', accuracy_score(y_test, prediction)
print '\nscore:', wclf.score(X_train, y_train)
print '\nrecall:', recall_score(y_test, prediction)
print '\nprecision:', precision_score(y_test, prediction)
print '\n clasification report:\n', classification_report(y_test, prediction)
print '\n confussion matrix:\n',confusion_matrix(y_test, prediction)
這是即時通訊使用的data。我該如何解決這個問題,並以正確的方式策劃這個問題?在此先感謝傢伙!
從這個問題的答案我刪除此行:
#
# from sklearn.decomposition.truncated_svd import TruncatedSVD
# svd = TruncatedSVD(n_components=5)
# reduced_data = svd.fit_transform(reduced_data)
#
# w = clf.coef_[0]
# a = -w[0]/w[1]
# xx = np.linspace(-10, 10)
# yy = a * xx - clf.intercept_[0]/w[1]
# ww = wclf.coef_[0]
# wa = -ww[0]/ww[1]
# wyy = wa * xx - wclf.intercept_[0]/ww[1]
#
# # plot separating hyperplanes and samples
# import matplotlib.pyplot as plt
# h0 = plt.plot(xx, yy, 'k-', label='no weights')
# h1 = plt.plot(xx, wyy, 'k--', label='with weights')
# plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=y, cmap=plt.cm.Paired)
# plt.legend()
#
# plt.axis('tight')
# plt.show()
This where the results:
Accuracy: 0.787878787879
score: 0.779437105112
recall: 0.787878787879
precision: 0.827705441238
這個指標的改善。 如何繪製這個結果以獲得像文檔一樣的好例子。我想看看兩架超飛機的行爲?。多謝你們!
'顯然這是由於我有一個不平衡的數據集的事實發生的 - 根據你所說的,我根本沒有發現這一點。你能告訴我們你的代碼甚至數據嗎? – IVlad 2015-02-12 11:05:14
沒有SVD和沒有觸及class_weight參數,你會得到什麼?嘗試首先關注性能,然後再進行繪圖。 – IVlad 2015-02-15 09:17:10
@Ivlad不使用從不平衡數據集這個文檔的例子是我得到的性能:'精度:0.461057692308 評分:0.461057692308 精度:0.212574195636 召回:0.461057692308'這是我可以用網格搜索做到最好。 – tumbleweed 2015-02-15 18:10:20