Python：快速遍歷np.array

我有一個超過1.5億個數據點的一維np數組，它使用np.fromfile填充二進制數據文件。Python：快速遍歷np.array

鑑於該數組，我需要爲每個點添加一個值'val'，除非該點等於'x'。

此外，數組中的每個值（取決於其值）都將對應於另一個我想要存儲在另一個列表中的值。的變量

說明：

**臨時工= np.arange（-30.00,0.01,0.01，D型細胞= 'FLOAT32'）

**單反列表，在臨時工索引0對應於索引0在slr等等。兩個列表的長度相同

這裏是我當前的代碼：

import sys 
import numpy as np 

with open("file.dat", "rb") as f: 
array = np.fromfile(f, dtype=np.float32) 
f.close() 

#This is the process below that I need to speed up 

T_SLR = np.array(np.zeros(len(array), dtype='Float64')) 
for i in range(0,len(array)): 
    if array[i] != float(-9.99e+08): 
     array[i] = array[i] - 273.15  
    if array[i] in temps: 
     index, = np.where(temps==array[i])[0] 
     T_SLR = slr[index] 
    else: 
     T_SLR[i] = 0.00

來源

2015-12-03 user2938093

看起來您的傳感器可能只會返回0.01度增量值。真的嗎？而且，如果是這樣的話，選擇'temps'是爲了讓所有的溫度在-30到0之間，還是你真的想要那些沒有百分之一小數的樣本進入'T_SLR'？ –

是的，臨時工應該有-30到0合併每0.01。那裏的每個溫度值都對應於列表slr中的slr值。 T_SLR是一個新的列表（將具有與'數組'相同的長度）。數組的值與臨時值進行比較，如果它的溫度低於索引值。該索引用於從slr中提取值。然後附加到T_SLR – user2938093

在你的代碼中的最慢點在列表中的O（n）的遍歷：

if array[i] in temps: 
    index, = np.where(temps==array[i])[0]

由於temps是不是很大，你可以將它與dict：

temps2 = dict(zip(temps, range(len(temps)))

，並使其O（1）：

if array[i] in temps2: 
    index = temps2[array[i]]

您還可以嘗試避免for循環加快。例如，下面的代碼：

for i in range(0,len(array)): 
    if array[i] != float(-9.99e+08): 
     array[i] = array[i] - 273.15

可以做到：

array[array!=float(-9.99e+08)] -= 273.15

另一個問題在你的代碼是浮動比較。您不應該使用完全相同的運算符==或!=，嘗試使用numpy.isclose，並將浮點數轉換爲int。

來源

2015-12-03 02:08:46 eph

由於您的選擇標準似乎是逐點的，因此您沒有理由需要閱讀全部1.5億分。您可以使用np.fromfile上的count參數來限制您一次比較的陣列的大小。一旦大於幾千塊的處理，for循環將無關緊要，並且您將不會使用來自所有1.5億個點的巨大數組來執行內存。

slr和temps看起來像索引轉換表。您可以用浮點比較和計算查找來替換temps上的搜索。由於-9.99e + 8明顯超出搜索標準，因此您不需要對這些點進行任何特殊處理。

f = open("file.dat", "rb") 
N = 10000 
T_SLR = np.zeros(size_of_TMPprs/4, dtype=np.float64) 
t_off = 0 
array = np.fromfile(f, count=N, dtype=np.float32) 
while array.size > 0: 
    array -= 273.15 
    index = np.where((array >= -30) & (array <= 0))[0] 
    T_SLR[t_off+index] = slr[np.round((array[index]+30)*100)] 
    t_off += array.size 
    array = np.fromfile(f, count=N, dtype=np.float32)

，如果你想T_SLR包含在slr中的最後一項，當測量值超過零，您可以簡化這個還要多。然後，可以使用

array = np.maximum(np.minimum(array, 0), -30)

限制值的範圍在array，只是將其用於計算索引slr如上（在這種情況下，不使用的where）。

來源

2015-12-03 02:47:00

我在size_of_TMPprs的停止處使用「os.fstat（f.fileno（））。st_size」，但得到以下錯誤： TypeError：只能將長度爲1的數組轉換爲Python標量關於T_SLR [t_off + index] = slr [int（（array [index] +30）* 100）] – user2938093

對不起！ int（）應該是np.round（），它返回一個可用於索引T_SLR的數組值。我在回答中改變了它。 –

我也注意到float32中有4個字節，而不是32個，正如我原來計算的那樣。答案也改變了。 –

當使用with open，不要自行將其關閉。 with上下文自動執行。我也改變了通用array名的東西有陰影別的東西的風險較小（如np.array？）

with open("file.dat", "rb") as f: 
    data = np.fromfile(f, dtype=np.float32)

首先沒有必要np.array包np.zeros。它已經是一個數組。 len(data)是確定的，如果data是一維的，但我更喜歡的工作shape元組。

T_SLR = np.zeros(data.shape, dtype='Float64')

布爾索引/掩蔽讓你成爲整個陣列上一次：

mask = data != -9.99e8 # don't need `float` here 
         # using != test with floats is poor idea 
data[mask] -= 273.15

我需要細化!=測試。整數可以，但不適用於浮點數。類似np.abs(data+9.99e8)>1是更好的

同樣in是不是一個很好的測試與浮動。並與整數時，in和where執行多餘的工作。

假設temps是圖1D中，np.where(...)返回1個元素的元組。 [0]選擇該元素，返回一個數組。 ,然後在index,中是多餘的。 index, = np.where()沒有[0]應該已經工作。

T_SLR[i]已經被數組的初始化爲0了。無需重新設置。

for i in range(0,len(array)): 
    if array[i] in temps: 
     index, = np.where(temps==array[i])[0] 
     T_SLR = slr[index] 
    else: 
     T_SLR[i] = 0.00

但我認爲我們也可以擺脫這種迭代。但我稍後會討論這個問題。

In [461]: temps=np.arange(-30.00,0.01,0.01, dtype='float32') 
In [462]: temps 
Out[462]: 
array([ -3.00000000e+01, -2.99899998e+01, -2.99799995e+01, ..., 
     -1.93138123e-02, -9.31358337e-03, 6.86645508e-04], dtype=float32) 
In [463]: temps.shape 
Out[463]: (3001,)

難怪做array[i] in temps和np.where(temps==array[i])緩慢

我們可以切出in與一看where

In [464]: np.where(temps==12.34) 
Out[464]: (array([], dtype=int32),) 
In [465]: np.where(temps==temps[3]) 
Out[465]: (array([3], dtype=int32),)

如果沒有匹配where回報一個空陣列。

In [466]: idx,=np.where(temps==temps[3]) 
In [467]: idx.shape 
Out[467]: (1,) 
In [468]: idx,=np.where(temps==123.34) 
In [469]: idx.shape 
Out[469]: (0,)

in可如果比賽是在列表中早於where快，但慢，如果不是更多的話，它的比賽時間是再結，或沒有匹配。

In [478]: timeit np.where(temps==temps[-1])[0].shape[0]>0 
10000 loops, best of 3: 35.6 µs per loop 
In [479]: timeit temps[-1] in temps 
10000 loops, best of 3: 39.9 µs per loop

一個四捨五入的方法：

In [487]: (np.round(temps,2)/.01).astype(int) 
Out[487]: array([-3000, -2999, -2998, ..., -2, -1,  0])

我建議的調整：

T_SLR = -np.round(data, 2)/.01).astype(int)

來源

2015-12-03 03:44:40 hpaulj

嗨，謝謝。我已納入您的更改並瞭解詳細的回覆。然而，這是我需要消除的for循環。遍歷'數據'數組的每個索引都非常緩慢，並且經常崩潰內核。非常感謝這方面的幫助。 – user2938093

看看'temps'的形狀。它很大。我們需要考慮一種更好的測試方法，或者將'數據'值映射到'索引'。 – hpaulj

臨時形狀（3001,0）。作爲一個參考，這開始在python知識的尖端上變得蹣跚起伏。因爲我已經處理了較小的文件，所以我已經能夠使用上述粗略的方法。 – user2938093

因爲temps進行排序，你可以使用np.searchsorted，並避免任何顯式循環：

array[array != float(-9.99e+08)] -= 273.15 
indices = np.searchsorted(temps, array) 
# Remove indices out of bounds 
mask = indices < array.shape[0] 
# Remove in-bounds indices not matching exactly 
mask[mask] &= temps[indices[mask]] != array[mask] 
T_SLR = np.where(mask, slr[indices[mask]], 0)

來源

2015-12-03 05:38:11 Jaime

Python：快速遍歷np.array

回答

相關問題