6
徹底分析我的程序後,我已經能夠確定它正在被矢量化器放慢速度。sklearn:如何加速矢量化器(例如Tfidfvectorizer)
我正在處理文本數據,兩行簡單的tfidf單向量矢量化佔用代碼執行總時間的99.2%。
這裏是一個可運行的例子(這將下載一個3MB的培訓文件到您的磁盤,省略了urllib的零件在自己的樣品進行):
#####################################
# Loading Data
#####################################
import urllib
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk.stem
raw = urllib.urlopen("https://s3.amazonaws.com/hr-testcases/597/assets/trainingdata.txt").read()
file = open("to_delete.txt","w").write(raw)
###
def extract_training():
f = open("to_delete.txt")
N = int(f.readline())
X = []
y = []
for i in xrange(N):
line = f.readline()
label,text = int(line[0]), line[2:]
X.append(text)
y.append(label)
return X,y
X_train, y_train = extract_training()
#############################################
# Extending Tfidf to have only stemmed features
#############################################
english_stemmer = nltk.stem.SnowballStemmer('english')
class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(TfidfVectorizer, self).build_analyzer()
return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))
tfidf = StemmedTfidfVectorizer(min_df=1, stop_words='english', analyzer='word', ngram_range=(1,1))
#############################################
# Line below takes 6-7 seconds on my machine
#############################################
Xv = tfidf.fit_transform(X_train)
我試圖名單X_train
轉換爲NP。陣列,但性能沒有差異。
你可以在http://codereview.stackexchange.com/上試試這個。 – matsjoyce 2014-10-04 18:27:37