2014-09-12 14 views
1

我想建立一個模型來預測工作描述的工資是高於還是低於第75百分位(高於1,低於0)我的數據有大約250,000行,很難標記來自職位描述的所有文本。我的代碼似乎工作正常,但它需要瘋狂的時間做超過100行。我需要找到一種更高效的方法,以便我可以在預測中添加更多行。優化NLTK代碼從文本進行預測

import random 
import nltk 
import pandas 
import csv 
import numpy as np 

io = pandas.read_csv('Train_rev1.csv',sep=',',usecols=(2,10), nrows=501) 
#converted = df.apply(lambda io : int(io[0])) 
data = [np.array(x) for x in io.values] 

random.shuffle(data) 
size = int(len(data) * 0.6) 
test_set, train_set = data[size:], data[:size] 
train_set = np.array(train_set) 
test_set = np.array(test_set) 
x = train_set[:,1] 
Sal75=np.percentile(x,75) 
y = test_set[:,1] 
Test75=np.percentile(y,75) 

for i in range(len(train_set[:,1])): 
    if train_set[i,1]>=Sal75: 
     train_set[i,1]=1 
    else: 
     train_set[i,1]=0 

for i in range(len(test_set[:,1])): 
    if test_set[i,1]>=Test75: 
     test_set[i,1]=1 
    else: 
     test_set[i,1]=0 

train_setT = [tuple(x) for x in train_set] 
test_setT = [tuple(x) for x in test_set] 



from nltk.tokenize import word_tokenize 
all_words = set(word.lower() for passage in train_setT for word in word_tokenize(passage[0])) 
t = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train_setT] 

classifier = nltk.NaiveBayesClassifier.train(t) 

all_words2 = set(word.lower() for passage in test_setT for word in word_tokenize(passage[0])) 
tt = [({word: (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in test_setT] 


print nltk.classify.accuracy(classifier, tt) 
classifier.show_most_informative_features(20) 
testres = [] 
predres = [] 
for i in range(len(tt)): 
    testres.append(tt[i][1]) 
for i in range(len(tt)): 
    z = classifier.classify(tt[i][0]) 
    predres.append(z) 
from nltk.metrics import ConfusionMatrix 
cm = nltk.ConfusionMatrix(testres, predres) 
print(cm) 

csv文件是從Kaggle中提取的。 Use Train_rev1

+1

你是否分析了你的代碼,看看它的瓶頸在哪裏? – Dalek 2014-09-12 21:38:13

+0

一切順利,直到它開始標記每個工作描述。 – 2014-09-12 21:41:50

+0

@Dalek botleneck開始時,單詞開始得到標記。我不知道創建數據框或字典而不是元組會提高效率嗎?!我無法弄清楚如何編碼。 – 2014-09-12 22:35:33

回答

1

將數據分成60%和40%之後您可以執行以下操作。這將需要新的工具,也許不是NLTK。

import random 
import nltk 
import pandas 
import csv 
import numpy as np 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn import metrics 
from operator import itemgetter 
from sklearn.metrics import classification_report 
train_setT = [tuple(x) for x in train_set] 
test_setT = [tuple(x) for x in test_set] 


train_set = np.array([''.join(el[0]) for el in train_setT]) 
test_set = np.array([''.join(el[0]) for el in test_setT]) 

y_train = np.array([el[1] for el in train_setT]) 
y_test = np.array([el[1] for el in test_setT]) 

vectorizer = TfidfVectorizer(min_df=2,ngram_range=(1, 2), strip_accents='unicode', norm='l2') 

X_train = vectorizer.fit_transform(train_set) 
X_test = vectorizer.transform(test_set) 

nb_classifier = MultinomialNB().fit(X_train, y_train) 

y_nb_predicted = nb_classifier.predict(X_test) 


print metrics.confusion_matrix(y_test, y_nb_predicted) 
print classification_report(y_test, y_nb_predicted)