我怎樣才能讓我的Python代碼的運行速度

我對遍歷多個文件的NetCDF（大〜28G）代碼工作。 netcdf文件在整個域中具有多個4D變量[時間，東西，南北，高度]。目標是循環這些文件並遍歷域中所有這些變量的每個位置，並將某些變量存儲到一個大型數組中。當缺少或不完整的文件時，我用99.99填充值。現在我只是通過循環測試每日2個netcdf文件進行測試，但由於某種原因，它正在永久（〜14小時）。我不確定是否有方法來優化此代碼。我不認爲python應該花這麼長時間來完成這個任務，但也許這是python或我的代碼的問題。下面是我的代碼希望它是可讀的，如何使這個更快的任何建議是極大的讚賞：我怎樣才能讓我的Python代碼的運行速度

#Domain to loop over 
k_space = np.arange(0,37) 
j_space = np.arange(80,170) 
i_space = np.arange(200,307) 

predictors_wrf=[] 
names_wrf=[] 

counter = 0 
cdate = start_date 
while cdate <= end_date: 
    if cdate.month not in month_keep: 
     cdate+=inc 
     continue 
    yy = cdate.strftime('%Y')   
    mm = cdate.strftime('%m') 
    dd = cdate.strftime('%d') 
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00' 
    for i in i_space: 
     for j in j_space: 
      for k in k_space: 
        if os.path.isfile(filename): 
         f = nc.Dataset(filename,'r') 
         times = f.variables['Times'][1:] 
         num_lines = times.shape[0] 
         if num_lines == 144: 
          u = f.variables['U'][1:,k,j,i] 
          v = f.variables['V'][1:,k,j,i] 
          wspd = np.sqrt(u**2.+v**2.) 
          w = f.variables['W'][1:,k,j,i] 
          p = f.variables['P'][1:,k,j,i] 
          t = f.variables['T'][1:,k,j,i] 
         if num_lines < 144: 
          print "partial files for WRF: "+ filename 
          u = np.ones((144,))*99.99 
          v = np.ones((144,))*99.99 
          wspd = np.ones((144,))*99.99 
          w = np.ones((144,))*99.99 
          p = np.ones((144,))*99.99 
          t = np.ones((144,))*99.99 
        else: 
         u = np.ones((144,))*99.99 
         v = np.ones((144,))*99.99 
         wspd = np.ones((144,))*99.99 
         w = np.ones((144,))*99.99 
         p = np.ones((144,))*99.99 
         t = np.ones((144,))*99.99 
         counter=counter+1 
        predictors_wrf.append(u) 
        predictors_wrf.append(v) 
        predictors_wrf.append(wspd) 
        predictors_wrf.append(w) 
        predictors_wrf.append(p) 
        predictors_wrf.append(t) 
        u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i) 
        v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i) 
        wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i) 
        w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i) 
        p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i) 
        t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i) 
        names_wrf.append(u_names) 
        names_wrf.append(v_names) 
        names_wrf.append(wspd_names) 
        names_wrf.append(w_names) 
        names_wrf.append(p_names) 
        names_wrf.append(t_names) 
    cdate+=inc

來源

2017-02-22 HM14

可以使用多在同一時間處理的文件。安排K，J，我對空間不同的工藝，讓他們每個人做自己的任務 – haifzhan

什麼是'nc.Dataset'？另外，在你提高速度之前，你需要知道爲什麼它很慢。您需要分析您的代碼並*測量*。 –

這是我NetCDF文件中讀取如何使用Python我有一份聲明早些時候在這裏沒有顯示的代碼：進口netCDF4數控 – HM14

這是收緊你的forloop個跛腳第一遍。由於每個文件只使用一次文件形狀，因此可以將循環移出到循環外，這將減少中斷處理中的數據加載量。我仍然沒有得到什麼counter和inc做，因爲它們似乎沒有在循環中更新。你一定要尋找到重複的字符串連接性能，或者你追加到predictors_wrf和names_wrf性能外觀爲出發點

k_space = np.arange(0,37) 
j_space = np.arange(80,170) 
i_space = np.arange(200,307) 

predictors_wrf=[] 
names_wrf=[] 

counter = 0 
cdate = start_date 
while cdate <= end_date: 
    if cdate.month not in month_keep: 
     cdate+=inc 
     continue 
    yy = cdate.strftime('%Y')   
    mm = cdate.strftime('%m') 
    dd = cdate.strftime('%d') 
    filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00' 
    file_exists = os.path.isfile(filename) 
    if file_exists: 
     f = nc.Dataset(filename,'r') 
     times = f.variables['Times'][1:] 
     num_lines = times.shape[0] 
    for i in i_space: 
     for j in j_space: 
      for k in k_space: 
        if file_exists:  
         if num_lines == 144: 
          u = f.variables['U'][1:,k,j,i] 
          v = f.variables['V'][1:,k,j,i] 
          wspd = np.sqrt(u**2.+v**2.) 
          w = f.variables['W'][1:,k,j,i] 
          p = f.variables['P'][1:,k,j,i] 
          t = f.variables['T'][1:,k,j,i] 
         if num_lines < 144: 
          print "partial files for WRF: "+ filename 
          u = np.ones((144,))*99.99 
          v = np.ones((144,))*99.99 
          wspd = np.ones((144,))*99.99 
          w = np.ones((144,))*99.99 
          p = np.ones((144,))*99.99 
          t = np.ones((144,))*99.99 
        else: 
         u = np.ones((144,))*99.99 
         v = np.ones((144,))*99.99 
         wspd = np.ones((144,))*99.99 
         w = np.ones((144,))*99.99 
         p = np.ones((144,))*99.99 
         t = np.ones((144,))*99.99 
         counter=counter+1 
        predictors_wrf.append(u) 
        predictors_wrf.append(v) 
        predictors_wrf.append(wspd) 
        predictors_wrf.append(w) 
        predictors_wrf.append(p) 
        predictors_wrf.append(t) 
        u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i) 
        v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i) 
        wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i) 
        w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i) 
        p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i) 
        t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i) 
        names_wrf.append(u_names) 
        names_wrf.append(v_names) 
        names_wrf.append(wspd_names) 
        names_wrf.append(w_names) 
        names_wrf.append(p_names) 
        names_wrf.append(t_names) 
    cdate+=inc

來源

2017-02-22 04:38:36 Selecsosi

我沒有很多的建議，但幾件事情要注意。

不要打開文件這麼多次

首先，定義這個filename變量，然後這個循環裏（內心深處：三for循環深），你如果該文件存在，並檢查想必打開它那裏（我不知道是什麼nc.Dataset做，但我猜它必須打開該文件，並讀取它）：

filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00' 
    for i in i_space: 
     for j in j_space: 
      for k in k_space: 
        if os.path.isfile(filename): 
         f = nc.Dataset(filename,'r')

這將是非常低效。如果文件在所有循環之前沒有更改，您肯定可以打開一次。

嘗試for循環

所有這些嵌套的for循環的複合您需要執行的操作次數使用少。一般建議：嘗試使用numpy操作。

使用CPROFILE

如果你想知道爲什麼你的程序需要很長的時間，找出最好的方法之一就是輪廓他們。

來源

2017-02-22 04:29:54 erewok

對於你的問題，我認爲multiprocessing將有很大的幫助。我瀏覽了你的代碼，並在這裏得到了一些建議。

不使用開始時間，而是使用文件名作爲代碼中的迭代器。

換行功能，找出基於時間的所有文件名，並返回所有文件名列表。

def fileNames(start_date, end_date): 
    # Find all filenames. 
    cdate = start_date 
    fileNameList = [] 
    while cdate <= end_date: 
     if cdate.month not in month_keep: 
      cdate+=inc 
      continue 
     yy = cdate.strftime('%Y')   
     mm = cdate.strftime('%m') 
     dd = cdate.strftime('%d') 
     filename = wrf_path+'\wrfoutRED_d01_'+yy+'-'+mm+'-'+dd+'_'+hour_str+'_00_00' 
     fileNameList.append(filename) 
     cdate+=inc 

    return fileNameList

包裝你的代碼，你的數據並填寫99。99，函數的輸入是文件名。

def dataExtraction(filename): 
    file_exists = os.path.isfile(filename) 
    if file_exists: 
     f = nc.Dataset(filename,'r') 
     times = f.variables['Times'][1:] 
     num_lines = times.shape[0] 
    for i in i_space: 
     for j in j_space: 
      for k in k_space: 
       if file_exists:  
        if num_lines == 144: 
         u = f.variables['U'][1:,k,j,i] 
         v = f.variables['V'][1:,k,j,i] 
         wspd = np.sqrt(u**2.+v**2.) 
         w = f.variables['W'][1:,k,j,i] 
         p = f.variables['P'][1:,k,j,i] 
         t = f.variables['T'][1:,k,j,i] 
        if num_lines < 144: 
         print "partial files for WRF: "+ filename 
         u = np.ones((144,))*99.99 
         v = np.ones((144,))*99.99 
         wspd = np.ones((144,))*99.99 
         w = np.ones((144,))*99.99 
         p = np.ones((144,))*99.99 
         t = np.ones((144,))*99.99 
        else: 
         u = np.ones((144,))*99.99 
         v = np.ones((144,))*99.99 
         wspd = np.ones((144,))*99.99 
         w = np.ones((144,))*99.99 
         p = np.ones((144,))*99.99 
         t = np.ones((144,))*99.99 
         counter=counter+1 
        predictors_wrf.append(u) 
        predictors_wrf.append(v) 
        predictors_wrf.append(wspd) 
        predictors_wrf.append(w) 
        predictors_wrf.append(p) 
        predictors_wrf.append(t) 
        u_names = 'u_'+str(k)+'_'+str(j)+'_'+str(i) 
        v_names = 'v_'+str(k)+'_'+str(j)+'_'+str(i) 
        wspd_names = 'wspd_'+str(k)+'_'+str(j)+'_'+str(i) 
        w_names = 'w_'+str(k)+'_'+str(j)+'_'+str(i) 
        p_names = 'p_'+str(k)+'_'+str(j)+'_'+str(i) 
        t_names = 't_'+str(k)+'_'+str(j)+'_'+str(i) 
        names_wrf.append(u_names) 
        names_wrf.append(v_names) 
        names_wrf.append(wspd_names) 
        names_wrf.append(w_names) 
        names_wrf.append(p_names) 
        names_wrf.append(t_names) 


    return zip(predictors_wrf, names_wrf)

使用多處理來完成您的工作。一般來說，所有的計算機都有一個以上的CPU核心。當有大量CPU計算時，多處理將有助於提高速度。根據我以前的經驗，多處理會減少大數據集消耗2/3時間。

更新：再次測試於2017年2月25日我的代碼的文件後，我發現，使用8芯的爲一個巨大的數據集爲我節省了90％的收縮時間。
```
if __name__ == '__main__': 
     from multiprocessing import Pool # This should be in the beginning statements. 
     start_date = '01-01-2017' 
     end_date = '01-15-2017' 
     fileNames = fileNames(start_date, end_date) 
     p = Pool(4) # the cores numbers you want to use. 
     results = p.map(dataExtraction, fileNames) 
     p.close() 
     p.join() 
```
最後，請注意這裏的數據結構，因爲它是相當複雜的。希望這可以幫助。如果您還有其他問題，請留下評論。

來源

2017-02-22 15:55:48

我怎樣才能讓我的Python代碼的運行速度

回答

相關問題