1
我想建立一個模型來預測工作描述的工資是高於還是低於第75百分位(高於1,低於0)我的數據有大約250,000行,很難標記來自職位描述的所有文本。我的代碼似乎工作正常,但它需要瘋狂的時間做超過100行。我需要找到一種更高效的方法,以便我可以在預測中添加更多行。優化NLTK代碼從文本進行預測
import random
import nltk
import pandas
import csv
import numpy as np
io = pandas.read_csv('Train_rev1.csv',sep=',',usecols=(2,10), nrows=501)
#converted = df.apply(lambda io : int(io[0]))
data = [np.array(x) for x in io.values]
random.shuffle(data)
size = int(len(data) * 0.6)
test_set, train_set = data[size:], data[:size]
train_set = np.array(train_set)
test_set = np.array(test_set)
x = train_set[:,1]
Sal75=np.percentile(x,75)
y = test_set[:,1]
Test75=np.percentile(y,75)
for i in range(len(train_set[:,1])):
if train_set[i,1]>=Sal75:
train_set[i,1]=1
else:
train_set[i,1]=0
for i in range(len(test_set[:,1])):
if test_set[i,1]>=Test75:
test_set[i,1]=1
else:
test_set[i,1]=0
train_setT = [tuple(x) for x in train_set]
test_setT = [tuple(x) for x in test_set]
from nltk.tokenize import word_tokenize
all_words = set(word.lower() for passage in train_setT for word in word_tokenize(passage[0]))
t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train_setT]
classifier = nltk.NaiveBayesClassifier.train(t)
all_words2 = set(word.lower() for passage in test_setT for word in word_tokenize(passage[0]))
tt = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in test_setT]
print nltk.classify.accuracy(classifier, tt)
classifier.show_most_informative_features(20)
testres = []
predres = []
for i in range(len(tt)):
testres.append(tt[i][1])
for i in range(len(tt)):
z = classifier.classify(tt[i][0])
predres.append(z)
from nltk.metrics import ConfusionMatrix
cm = nltk.ConfusionMatrix(testres, predres)
print(cm)
csv文件是從Kaggle中提取的。 Use Train_rev1
你是否分析了你的代碼,看看它的瓶頸在哪裏? – Dalek 2014-09-12 21:38:13
一切順利,直到它開始標記每個工作描述。 – 2014-09-12 21:41:50
@Dalek botleneck開始時,單詞開始得到標記。我不知道創建數據框或字典而不是元組會提高效率嗎?!我無法弄清楚如何編碼。 – 2014-09-12 22:35:33