使用pool.map將函數應用於並行字符串列表？

我有一大堆http用戶代理字符串（取自熊貓數據框），我嘗試使用python實現ua-parser進行分析。我只能使用單個線程解析列表，但基於一些初步的速度測試，運行整個數據集需要10個多小時。使用pool.map將函數應用於並行字符串列表？

我想使用pool.map()來減少處理時間，但似乎無法弄清楚如何讓它工作。我已經閱讀了大約十幾個我在網上找到的「教程」，並且已經搜索過了（可能是某種類型的重複，因爲有很多類似的問題），但是幾十次嘗試都沒有出於某種原因。我假設/希望這是一個簡單的修復。

這是我到目前爲止有：

from ua_parser import user_agent_parser  

http_str = df['user_agents'].tolist() 

def uaparse(http_str): 
     for i, item in enumerate(http_str): 
      return user_agent_parser.Parse(http_str[i]) 

pool = mp.Pool(processes=10) 
parsed = pool.map(uaparse, range(0,len(http_str))

現在我看到了以下錯誤消息：

--------------------------------------------------------------------------- 
TypeError         Traceback (most recent call last) 
<ipython-input-25-701fbf58d263> in <module>() 
     7 
     8 pool = mp.Pool(processes=10) 
----> 9 results = pool.map(uaparse, range(0,len(http_str))) 

/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in map(self, func, iterable, chunksize) 
    249   ''' 
    250   assert self._state == RUN 
--> 251   return self.map_async(func, iterable, chunksize).get() 
    252 
    253  def imap(self, func, iterable, chunksize=1): 

/home/ubuntu/anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout) 
    565    return self._value 
    566   else: 
--> 567    raise self._value 
    568 
    569  def _set(self, i, obj): 

TypeError: 'int' object is not iterable

預先感謝任何幫助/方向你可以提供。

來源

2015-09-25 scribbles

好像所有你需要的是：

http_str = df['user_agents'].tolist() 

pool = mp.Pool(processes=10) 
parsed = pool.map(user_agent_parser.Parse, http_str)

來源

2015-09-25 19:16:56

謝謝！從來沒有想過會這麼簡單。 – scribbles

使用pool.map將函數應用於並行字符串列表？

回答

相關問題