2017-02-10 72 views
0

我現在有一個utilities.py文件具有本機的學習功能python-rq隊列中的python scikit函數運行得更快嗎?

from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.model_selection import train_test_split 
import models 
import random 

words = [w.strip() for w in open('words.txt') if w == w.lower()] 
def scramble(s): 
    return "".join(random.sample(s, len(s))) 

@models.db_session 
def check_pronounceability(word): 

    scrambled = [scramble(w) for w in words] 

    X = words+scrambled 
    y = ['word']*len(words) + ['unpronounceable']*len(scrambled) 
    X_train, X_test, y_train, y_test = train_test_split(X, y) 

    text_clf = Pipeline([ 
     ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))), 
     ('clf', MultinomialNB()) 
     ]) 
    text_clf = text_clf.fit(X_train, y_train) 
    stuff = text_clf.predict_proba([word]) 
    pronounceability = round(100*stuff[0][1], 2) 
    models.Word(word=word, pronounceability=pronounceability) 
    models.commit() 
    return pronounceability 

然後我在我的app.py

from flask import Flask, render_template, jsonify, request 
from rq import Queue 
from rq.job import Job 
from worker import conn 
from flask_cors import CORS 
from utilities import check_pronounceability 

app = Flask(__name__) 

q = Queue(connection=conn) 

import models 
@app.route('/api/word', methods=['POST', 'GET']) 
@models.db_session 
def check(): 
    if request.method == "POST": 
     word = request.form['word'] 
     if not word: 
      return render_template('index.html') 
     db_word = models.Word.get(word=word) 
     if not db_word: 
      job = q.enqueue_call(check_pronounceability, args=(word,)) 
     return jsonify(job=job.id) 

調用讀python-rq preformance notes它規定

的模式你之後可以用來提高這些吞吐量性能 類型的工作可以導入t他在fork之前需要模塊。

然後我所做的worker.py文件看起來像這樣

import os 

import redis 
from rq import Worker, Queue, Connection 

listen = ['default'] 

redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379') 

conn = redis.from_url(redis_url) 
import utilities 

if __name__ == '__main__': 
    with Connection(conn): 
     worker = Worker(list(map(Queue, listen))) 
     worker.work() 

我已經是這仍然運行速度慢的問題,是不是我做錯了什麼?當我檢查一個單詞時,我可以通過將所有內容存儲在內存中來使其更快運行。據xpost I did in the python-rq看來我正確地將其導入

回答

1

我有幾個建議:

  1. 你開始優化python-rq檢查瓶頸在哪裏是的吞吐量前。如果隊列是瓶頸而不是check_pronounceability函數,我會感到驚訝。

  2. 確保check_pronounceability的運行速度與每次調用時一樣快,在此階段忘記了不相關的隊列。

爲了優化check_pronounceability我建議你

  1. 創建訓練數據一次所有 API調用

  2. 忘記train_test_split你不使用test_split,那你爲什麼要浪費CPU週期來創建它

  3. 火車NaiveBayes 一次所有 API調用 - 輸入到check_pronounceability是一個字需要被歸類爲可發音與否,有沒有必要爲每一個新詞新模式,只需創建一個模型,再利用,對於所有的話,這將產生穩定的結果,以及受益,它可以更容易地改變 下面

建議修改模型

from sklearn.pipeline import Pipeline 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.naive_bayes import MultinomialNB 
from sklearn.preprocessing import LabelBinarizer 
from sklearn.model_selection import train_test_split 
import models 
import random 

words = [w.strip() for w in open('words.txt') if w == w.lower()] 
def scramble(s): 
    return "".join(random.sample(s, len(s))) 

scrambled = [scramble(w) for w in words] 
X = words+scrambled 
# explicitly create binary labels 
label_binarizer = LabelBinarizer() 
y = label_binarizer.fit_transform(['word']*len(words) + ['unpronounceable']*len(scrambled)) 

text_clf = Pipeline([ 
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))), 
    ('clf', MultinomialNB()) 
]) 
text_clf = text_clf.fit(X, y) 
# you might want to persist the Pipeline to disk at this point to ensure it's not lost in case there is a crash  

@models.db_session 
def check_pronounceability(word): 
    stuff = text_clf.predict_proba([word]) 
    pronounceability = round(100*stuff[0][1], 2) 
    models.Word(word=word, pronounceability=pronounceability) 
    models.commit() 
    return pronounceability 

最後說明:

  • 我假設你已經做了模型的某些交叉驗證別處實際上弄清楚,它在預測標籤概率做好,如果你沒有,你應該。一般來說,NaiveBayes並不是最好的產生可靠的類概率預測,它傾向於過於自信或過於怯懦(概率接近於1或0)。你應該在數據庫中檢查。使用LogisticRegression分類器會給你更可靠的概率預測。既然模型訓練不是API調用的一部分,那麼訓練模型需要多長時間並不重要。

+0

謝謝,這固定了它。我將看看使用LogisticRegressions分類器來代替 – nadermx