我需要一些幫助來優化Python代碼

我正在使用Python的KNN分類器，但我有一些問題。下面這段代碼需要7.5s-9.0s才能完成，我必須運行它60.000次。我需要一些幫助來優化Python代碼

 for fold in folds: 
      for dot2 in fold: 
       """ 
       distances[x][0] = Class of the dot2 
       distances[x][1] = distance between dot1 and dot2 
       """ 
       distances.append([dot2[0], calc_distance(dot1[1:], dot2[1:], method)])

的「摺疊」變量是用10倍該求和包含在.csv格式的圖像的輸入60.000列表。每個點的第一個值是它所屬的類。所有的值都是整數。有沒有辦法讓這條生產線更快運行？

這是calc_distance功能

def calc_distancia(dot1, dot2, distance): 

if distance == "manhanttan": 
    total = 0 
    #for each coord, take the absolute difference 
    for x in range(0, len(dot1)): 
     total = total + abs(dot1[x] - dot2[x]) 
    return total 

elif distance == "euclidiana": 
    total = 0 
    for x in range(0, len(dot1)): 
     total = total + (dot1[x] - dot2[x])**2 
    return math.sqrt(total) 

elif distance == "supremum": 
    total = 0 
    for x in range(0, len(dot1)): 
     if abs(dot1[x] - dot2[x]) > total: 
      total = abs(dot1[x] - dot2[x]) 
    return total 

elif distance == "cosseno": 
    dist = 0 
    p1_p2_mul = 0 
    p1_sum = 0 
    p2_sum = 0 
    for x in range(0, len(dot1)): 
     p1_p2_mul = p1_p2_mul + dot1[x]*dot2[x] 
     p1_sum = p1_sum + dot1[x]**2 
     p2_sum = p2_sum + dot2[x]**2 
    p1_sum = math.sqrt(p1_sum) 
    p2_sum = math.sqrt(p2_sum) 
    quociente = p1_sum*p2_sum 
    dist = p1_p2_mul/quociente 

    return dist

編輯：找到了一種方法，使其更快，至少對於「manhanttan」的方法。相反的：

if distance == "manhanttan": 
    total = 0 
    #for each coord, take the absolute difference 
    for x in range(0, len(dot1)): 
     total = total + abs(dot1[x] - dot2[x]) 
    return total

我把

if distance == "manhanttan": 
    totalp1 = 0 
    totalp2 = 0 
    #for each coord, take the absolute difference 
    for x in range(0, len(dot1)): 
     totalp1 += dot1[x] 
     totalp2 += dot2[x] 

    return abs(totalp1-totalp2)

的abs()調用非常沉重

來源

2014-10-27 Victor

這裏有一些鏈接，可以幫助：https：//開頭的wiki。 python.org/moin/PythonSpeed/PerformanceTips http://nbviewer.ipython.org/github/rasbt/python_reference/blob/master/benchmarks/timeit_tests.ipynb?create=1#string_operations – Totem 2014-10-27 21:39:39

請編輯您的答案以包含整個代碼。還包括輸入（或至少其中的一部分）。 – 2014-10-27 21:40:04

*「有助於優化Python代碼」*在這裏不是一個主題問題。 – jonrsharpe 2014-10-27 21:40:05

有許多導遊「蟒蛇紋」;你應該搜索一些，閱讀它們，並通過分析過程來確保你知道你的工作的哪些部分花費最多的時間。

但是，如果這真的是你工作的核心，那麼calc_distance就是大部分運行時間被消耗的一個公平的選擇。

深入優化可能需要使用加速數學或類似的更低級方法NumPy。

作爲一種快速且骯髒的方法，需要較少侵入性的分析和重寫，請嘗試安裝Python的PyPy實現並在其下運行。與標準（CPython）實現相比，我已經看到簡單的2倍或更多的加速。

來源

2014-10-27 22:04:05

我很困惑。你有沒有試過探查器？

python -m cProfile myscript.py

它會告訴你在哪裏大部分時間被消耗並提供硬數據來處理。例如。重構減少的呼叫的數量，重組的輸入數據，代替這個函數，該函數等

https://docs.python.org/3/library/profile.html

來源

2014-10-27 22:18:06

我跟我的老師說過，他說時間是正確的。這需要很多時間。我使用這些參數，他們會幫助我很多。函數「calc_distance」需要很長時間來處理。我會盡量讓它更快。 – Victor 2014-10-28 02:10:59

您可以使用numpy數組提高很多。 – badc0re 2014-10-28 07:34:53

首先，應避免使用單個calc_distance函數，在一個執行線性搜索每次通話時的字符串列表。定義獨立的距離函數並調用正確的函數。正如李丹尼克羅克建議，不要使用切片，只需開始你的循環範圍爲1.

對於餘弦距離，我建議所有的點向量歸一化。這種方式的距離計算減少到點積。

這些微優化可以給你一些加速。但是切換到更好的算法應該可以獲得更好的收益：kNN分類器要求kD-tree，這將允許您從考慮中快速移除很大一部分點。

這是很難實現（你必須稍微適應了不同的距離，餘弦距離將使它非常棘手。）

來源

2014-10-28 08:50:31

我需要一些幫助來優化Python代碼

回答

相關問題