關於在我的代碼中加快/ if語句的建議？

我的代碼需要大約兩個小時來處理。瓶頸是for循環，如果聲明（請參閱代碼中的註釋）。我是初學者與蟒蛇:)任何人都可以推薦一種高效的Python方法來替換嵌套和if語句？關於在我的代碼中加快/ if語句的建議？

我有〜3000萬行的表，每一行與（X，Y，Z）值：

20.0 11.3 7
21.0 11.3 0
22.0 11.3 3
...

我所需的輸出是x，y，min（z），count（min（z））形式的表格。最後的列是該（x，y）處最小z值的最終計數。例如：

20.0 11.3 7 7
21.0 11.3 0 10
22.0 11.3 3 1
...

這裏只有大約600獨特的座標，所以輸出表將被600x4。我的代碼：

import numpy as np 
file = open('input.txt','r'); 

coordset = set() 
data = np.zeros((600,4))*np.nan 
irow = 0 
ctr = 0 

for row in file: 
    item = row.split() 
    x = float(item[0]) 
    y = float(item[1]) 
    z = float(item[2]) 

    # build unique grid of coords 
    if ((x,y)) not in coordset: 
     data[irow][0] = x 
     data[irow][1] = y 
     data[irow][2] = z 
     irow = irow + 1  # grows up to 599 

    # lookup table of unique coords 
    coordset.add((x,y)) 

    # BOTTLENECK. replace ifs? for? 
    for i in range(0, irow): 
     if data[i][0]==x and data[i][1]==y: 
      if z > data[i][2]: 
       continue 
      elif z==data[i][2]: 
       ctr = ctr + 1 
       data[i][3]=ctr 
      if z < data[i][2]: 
       data[i][2] = z 
       ctr = 1 
       data[i][3]=ctr

編輯：上供@Joowani的方法計算在1m26s。我原來的方法，相同的電腦，相同的數據文件，106m23s。 編輯2： @Ophion和@Sibster感謝您的建議，我沒有足夠的信用+1 +1有用的答案。

來源

2013-08-19 3150

是3千萬行真的很棒保存在TXT？你不應該看一些更復雜的格式來保存（並讀入）你的數據嗎？此外，我建議儘可能向量化（numpy），因爲這會將for-loops推入numpy，這是C（因此更快） – usethedeathstar

您的解決方案似乎很慢，因爲它會在您更新時通過每個遍歷列表（即數據）。一個更好的方法是使用一個字典，它將O（1）與每個更新的O（n）相對照。

這裏是使用字典我的解決方案：

file = open('input.txt', 'r') #coordinates c = {} for line in file: #items (x, y, z) = (float(n) for n in line.split()) if (x, y) not in c: c[(x, y)] = [z, 1] elif c[(x, y)][0] > z: c[(x, y)][0], c[(x, y)][1] = z, 1 elif c[(x, y)][0] == z: c[(x, y)][1] += 1 for key in c: print("{} {} {} {}".format(key[0], key[1], c[key][0], c[key][1]))

來源

2013-08-19 07:13:23 Joohwan

要添加一些，我找到了一個引用（http://wiki.python.org/moin/TimeComplexity）這解釋了字典操作中的一些效率。 – 3150

爲什麼不改變最後如果elif？

就像現在這樣做，您將在循環的每次迭代中評估z < data[i][2]:。

你甚至可以只用一個人換掉它，因爲你已經檢查if z>data[i][2]和z == data[i][2]所以剩下的唯一可能性是z < data[i][2]:

所以，下面的代碼會做同樣的，應該是更快：

 if z > data[i][2]: 
      continue 
     elif z==data[i][2]: 
      ctr = ctr + 1 
      data[i][3]=ctr 
     else: 
      data[i][2] = z 
      ctr = 1 
      data[i][3]=ctr

來源

2013-08-19 07:00:30 Sibster

要numpy的使用np.unique做到這一點。

def count_unique(arr): 
    row_view=np.ascontiguousarray(a).view(np.dtype((np.void,a.dtype.itemsize * a.shape[1]))) 
    ua, uind = np.unique(row_view,return_inverse=True) 
    unique_rows = ua.view(a.dtype).reshape(ua.shape + (-1,)) 
    count=np.bincount(uind) 
    return np.hstack((unique_rows,count[:,None]))

首先讓檢查了一小陣：

a=np.random.rand(10,3) 
a=np.around(a,0) 

print a 
[[ 0. 0. 0.] 
[ 0. 1. 1.] 
[ 0. 1. 0.] 
[ 1. 0. 0.] 
[ 0. 1. 1.] 
[ 1. 1. 0.] 
[ 1. 0. 1.] 
[ 1. 0. 1.] 
[ 1. 0. 0.] 
[ 0. 0. 0.]] 

print output 
[[ 0. 0. 0. 2.] 
[ 0. 1. 0. 1.] 
[ 0. 1. 1. 2.] 
[ 1. 0. 0. 2.] 
[ 1. 0. 1. 2.] 
[ 1. 1. 0. 1.]] 

print np.sum(output[:,-1]) 
10

看起來不錯！現在，讓我們檢查一個大陣：

a=np.random.rand(3E7,3) 
a=np.around(a,1) 

output=count_unique(a) 
print output.shape 
(1331, 4) #Close as I can get to 600 unique elements. 

print np.sum(output[:,-1]) 
30000000.0

注意到約33第二個我的機器上的3GB內存，所有內存中做這個大型陣列很可能是你的瓶頸。對於參考@ Joowani的解決方案花了大約130秒，雖然這是一個蘋果和橘子比較，因爲我們開始與一個numpy陣列。你的milage可能會有所不同。

要作爲numpy的陣列我想查看的問題here數據的讀取，但它看起來應該像下面這樣：

arr=np.genfromtxt("./input.txt", delimiter=" ")

加載在如此多的數據從一個txt文件，我真的建議使用該鏈接中的pandas示例。

來源

2013-08-19 13:11:50 Daniel

關於在我的代碼中加快/ if語句的建議？

回答

相關問題