分配python中的工作負載

-1

我有一個數據庫10,000 adam_id's。對於每個adam_id，我需要通過API下拉信息。分配python中的工作負載

我的表看起來像這樣：

`title` 
- adam_id 
- success (boolean) 
- number_of_tries (# of times success=0 when trying to do the pull down)

這裏是我想創建函數：

def pull_down(cursor): 
    work_remains = True 
    while work_remains: 
     cursor.execute("""SELECT adam_id FROM title WHERE success=0 
          AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""") 
     if len(cursor.fetchall()): 
      adam_id = cursor.fetchone()[0] 
      do_api_call(adam_id) 
     else: 
      work_remains = False 

def do_api_call(adam_id): 
    # do api call 
    if success: 
     cursor.execute("UPDATE title SET success=1 WHERE adam_id = adam_id") 
    else: 
     cursor.execute("UPDATE title SET number_of_tries+=1 WHERE adam_id=adam_id")

我會怎麼做上面用python的多處理功能n工人而不是做它有一個同步過程？我已經開始研究多處理模塊（http://docs.python.org/library/multiprocessing.html），但目前爲止我很難消化這個模塊。

來源

2012-08-23 David542

這是多處理將幫助緩慢的位？你有沒有分析過這些？首先，一次獲取多於一行的數據應該可以加快速度，我想可以。或者它的API調用很慢？什麼是API？ – jozzas

@jozzas：API調用每次調用大約需要30秒。 – David542

如果工作的重要部分是API調用，因爲它涉及到外部資源，那麼這將是您真正想要並行化的唯一部分。數據庫調用可能非常快。所以，你可以試試這個：

批獲得一個查詢
的adam_id值把IDS到進程池做API調用
得到的結果，並將其提交到數據庫

這是一個粗略的僞代碼示例以示出邏輯流：

from multiprocessing import Pool 

def pull_down(cursor): 
    # get all the data in one query 
    count = cursor.execute("""SELECT adam_id FROM title WHERE success=0 
         AND number_of_tries < 5 ORDR BY adam_id LIMIT 1""") 
    if count: 
     # Step #1 
     adam_id_list = [row[0] for row in cursor.fetchall()] 

     # Step #2 
     pool = Pool(4) 
     results = pool.map(do_api_call, adam_id_list) 
     pool.close() 

     # Step #3 
     update_db(results) 

def do_api_call(adam_id): 
    # do api call 
    success = call_api_with_id(adam_id) 
    return (adam_id, success) 

def update_db(results): 
    # loop over results and built batch queries for the success 
    # or failed items 

    # (obviously this split up could be optimized) 
    succeeded = [result[0] for result in results if result[1]] 
    failed = [result[0] for result in results if not result[1]] 

    submit_success(succeeded) 
    submit_failed(failed)

那隻共如果您試圖使數據庫調用並行，則複製代碼，因爲那樣您必須正確地爲每個進程提供它自己的連接，而實際上不會導致數據庫變慢。

來源

2012-08-24 00:20:39 jdi

分配python中的工作負載

回答

相關問題