2012-08-23 36 views
-1

我有一個數據庫10,000 adam_id's。對於每個adam_id,我需要通過API下拉信息。分配python中的工作負載

我的表看起來像這樣:

`title` 
- adam_id 
- success (boolean) 
- number_of_tries (# of times success=0 when trying to do the pull down) 

這裏是我想創建函數:

def pull_down(cursor): 
    work_remains = True 
    while work_remains: 
     cursor.execute("""SELECT adam_id FROM title WHERE success=0 
          AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""") 
     if len(cursor.fetchall()): 
      adam_id = cursor.fetchone()[0] 
      do_api_call(adam_id) 
     else: 
      work_remains = False 

def do_api_call(adam_id): 
    # do api call 
    if success: 
     cursor.execute("UPDATE title SET success=1 WHERE adam_id = adam_id") 
    else: 
     cursor.execute("UPDATE title SET number_of_tries+=1 WHERE adam_id=adam_id") 

我會怎麼做上面用python的多處理功能n工人而不是做它有一個同步過程?我已經開始研究多處理模塊(http://docs.python.org/library/multiprocessing.html),但目前爲止我很難消化這個模塊。

+0

這是多處理將幫助緩慢的位?你有沒有分析過這些?首先,一次獲取多於一行的數據應該可以加快速度,我想可以。或者它的API調用很慢?什麼是API? – jozzas

+0

@jozzas:API調用每次調用大約需要30秒。 – David542

回答

1

如果工作的重要部分是API調用,因爲它涉及到外部資源,那麼這將是您真正想要並行化的唯一部分。數據庫調用可能非常快。所以,你可以試試這個:

  1. 批獲得一個查詢
  2. adam_id值把IDS到進程池做API調用
  3. 得到的結果,並將其提交到數據庫

這是一個粗略的僞代碼示例以示出邏輯流:

from multiprocessing import Pool 

def pull_down(cursor): 
    # get all the data in one query 
    count = cursor.execute("""SELECT adam_id FROM title WHERE success=0 
         AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""") 
    if count: 
     # Step #1 
     adam_id_list = [row[0] for row in cursor.fetchall()] 

     # Step #2 
     pool = Pool(4) 
     results = pool.map(do_api_call, adam_id_list) 
     pool.close() 

     # Step #3 
     update_db(results) 

def do_api_call(adam_id): 
    # do api call 
    success = call_api_with_id(adam_id) 
    return (adam_id, success) 

def update_db(results): 
    # loop over results and built batch queries for the success 
    # or failed items 

    # (obviously this split up could be optimized) 
    succeeded = [result[0] for result in results if result[1]] 
    failed = [result[0] for result in results if not result[1]] 

    submit_success(succeeded) 
    submit_failed(failed) 

那隻共如果您試圖使數據庫調用並行,則複製代碼,因爲那樣您必須正確地爲每個進程提供它自己的連接,而實際上不會導致數據庫變慢。