2017-02-16 39 views
1

我正在使用蟒蛇 - Python 3.5.2如何提高此python請求會話的速度?

我有一個280,000個網址的列表。 我正在抓取數據並試圖追蹤url-to-data。

我已經提出了約30K個請求。我每秒平均請求1次。

response_df = pd.DataFrame() 
# create the session 
with requests.Session() as s: 
    # loop through the list of urls 
    for url in url_list: 
     # call the resource 
     resp = s.get(url) 
     # check the response 
     if resp.status_code == requests.codes.ok: 
      # create a new dataframe with the response    
      ftest = json_normalize(resp.json()) 
      ftest['url'] = url 
      response_df = response_df.append(ftest, ignore_index=True) 
     else: 
      print("Something went wrong! Hide your wife! Hide the kids!") 

response_df.to_csv(results_csv) 
+0

請同時適當縮進代碼 –

+0

,考慮paralellizing代碼。 –

+0

另外,請考慮預先分配輸出DF。 –

回答

1

我最終放棄了請求,我改用了async和aiohttp。我的請求每秒鐘約1次。新方法平均每秒大約5次,僅佔用我係統資源的大約20%。最後我用非常類似於這樣: https://www.blog.pythonlibrary.org/2016/07/26/python-3-an-intro-to-asyncio/

import aiohttp 
import asyncio 
import async_timeout 
import os 

async def download_coroutine(session, url): 
    with async_timeout.timeout(10): 
     async with session.get(url) as response: 
      filename = os.path.basename(url) 
      with open(filename, 'wb') as f_handle: 
       while True: 
        chunk = await response.content.read(1024) 
        if not chunk: 
         break 
        f_handle.write(chunk) 
      return await response.release() 

async def main(loop): 
    urls = ["http://www.irs.gov/pub/irs-pdf/f1040.pdf", 
     "http://www.irs.gov/pub/irs-pdf/f1040a.pdf", 
     "http://www.irs.gov/pub/irs-pdf/f1040ez.pdf", 
     "http://www.irs.gov/pub/irs-pdf/f1040es.pdf", 
     "http://www.irs.gov/pub/irs-pdf/f1040sb.pdf"] 

async with aiohttp.ClientSession(loop=loop) as session: 
    for url in urls: 
     await download_coroutine(session, url) 


if __name__ == '__main__': 
    loop = asyncio.get_event_loop() 
    loop.run_until_complete(main(loop)) 

也,這是有幫助的: https://snarky.ca/how-the-heck-does-async-await-work-in-python-3-5/ http://www.pythonsandbarracudas.com/blog/2015/11/22/developing-a-computational-pipeline-using-the-asyncio-module-in-python-3