讀寫HDF5文件中的numpy陣列

我正在構建模擬軟件，並且需要在HDF5文件的表中寫入（數千個）2D numpy陣列，其中陣列的一個維度是可變的。傳入array是float32類型;爲了節省磁盤空間，每個數組都以表格的形式存儲，併爲列提供適當的數據類型（因此不使用數組）。當我讀表時，我想檢索一個float32類型的numpy.ndarray，所以我可以爲分析做很好的計算。下面是帶有物種A，B和C加上時間的陣列的示例代碼。讀寫HDF5文件中的numpy陣列

我目前正在閱讀和寫作'作品'的方式，但它非常緩慢。因此，問題是：將array快速存入table的適當方式是什麼，並將其再讀回到ndarrays中？我一直在嘗試numpy.recarray，但我無法得到這個工作（輸入錯誤，尺寸錯誤，完全錯誤的數字等）？

代碼：

import tables as pt 
import numpy as np 

# Variable dimension 
var_dim=100 

# Example array, rows 0 and 3 should be stored as float32, rows 1 and 2 as uint16 
array=(np.random.random((4, var_dim)) * 100).astype(dtype=np.float32) 

filename='test.hdf5' 
hdf=pt.open_file(filename=filename,mode='w') 
group=hdf.create_group(hdf.root,"group") 

particle={ 
    'A':pt.Float32Col(), 
    'B':pt.UInt16Col(), 
    'C':pt.UInt16Col(), 
    'time':pt.Float32Col(), 
    } 
dtypes=np.array([ 
    np.float32, 
    np.uint16, 
    np.uint16, 
    np.float32 
    ]) 

# This is the table to be stored in 
table=hdf.create_table(group,'trajectory', description=particle, expectedrows=var_dim) 

# My current way of storing 
for i, row in enumerate(array.T): 
    table.append([tuple([t(x) for t, x in zip(dtypes, row)])]) 
table.flush() 
hdf.close() 


hdf=pt.open_file(filename=filename,mode='r') 
array_table=hdf.root.group._f_iter_nodes().__next__() 

# My current way of reading 
row_list = [] 
for i, row in enumerate(array_table.read()): 
    row_list.append(np.array(list(row))) 

#The retreived array 
array=np.asarray(row_list).T 


# I've tried something with a recarray 
rec_array=array_table.read().view(type=np.recarray) 

# This gives me errors, or wrong results 
rec_array.view(dtype=np.float64) 
hdf.close()

的錯誤，我得到：

Traceback (most recent call last): 
    File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 475, in __setattr__ 
    ret = object.__setattr__(self, attr, val) 
ValueError: new type not compatible with array. 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
    File "/home/thomas/Documents/Thesis/SO.py", line 53, in <module> 
    rec_array.view(dtype=np.float64) 
    File "/home/thomas/anaconda3/lib/python3.6/site-packages/numpy/core/records.py", line 480, in __setattr__ 
    raise exctype(value) 
ValueError: new type not compatible with array. 
Closing remaining open files:test.hdf5...done

來源

2017-04-26 Patrickens

這可能有助於看到array'的'形狀和D型（從第一個' asarray'）。我猜這已經是一個結構化數組。或者'recarray'版本的類似信息。 – hpaulj

使用表格是唯一可能的解決方案嗎？你以後如何訪問你的數據（只有整個二維數組或子集）？ – max9111

您的真實數據中是否只有少量列（在您的示例中只有四列）？你的數據是否可壓縮？甚至有損壓縮你的可能性？ https://computation.llnl.gov/projects/floating-point-compression/zfp-compression-ratio-and-quality – max9111

作爲一個快速和骯髒的解決方案可以通過陣列暫時轉換爲列表（如果你能抽出來aviod循環記憶）。出於某種原因，記錄陣列很容易轉換爲/從列表中轉換，但不能轉換爲常規陣列。

儲存：

table.append(array.T.tolist())

加載：

loaded_array = np.array(array_table.read().tolist(), dtype=np.float64).T

應該有記錄陣列和傳統陣列之間進行轉換更「Numpythonic」的做法，但我不熟悉不夠與前知道如何。

來源

2017-04-26 14:09:46 kazemakase

這已經使代碼更具可讀性！我仍然想知道Numpythonic的方式是什麼。不過，謝謝！ – Patrickens

@Patrickens我剛剛發現[np.core.records.fromarrays]（https://docs.scipy.org/doc/numpy/reference/generated/numpy.core.records.fromarrays.html），但我不知道在這種情況下，它認爲它沒有任何好處。它在內部將數組轉換爲一個比'.tolist（）'效率低的數組列表，它需要更多的參數。也許我的方法並非如此Un-Numpythonic :) – kazemakase

在有限的情況下，可以使用'view'或'astype'將結構化數組轉換爲數字形式，但這種'tolist'中介是最常用的手段。請注意，採用其他方式需要將列表列表轉換爲元組列表。另一種方法是按名稱複製字段。由於記錄的數量通常比字段的數量大，所以在迭代過程中不會太多鬆動。 – hpaulj

我沒有與tables一起工作，但看過它的文件與h5py。我猜，然後你array或recarray是一個結構數組與D型，如：

In [131]: dt=np.dtype('f4,u2,u2,f4') 
In [132]: np.array(arr.tolist(), float) 
Out[132]: 
array([[ 1., 1., 1., 1.], 
     [ 1., 1., 1., 1.], 
     [ 1., 1., 1., 1.]]) 
In [133]: arr 
Out[133]: 
array([(1., 1, 1, 1.), (1., 1, 1, 1.), (1., 1, 1, 1.)], 
     dtype=[('f0', '<f4'), ('f1', '<u2'), ('f2', '<u2'), ('f3', '<f4')])

使用@kazemakase'stolist方法（這是我在其他職位也推薦）：

In [134]: np.array(arr.tolist(), float) 
Out[134]: 
array([[ 1., 1., 1., 1.], 
     [ 1., 1., 1., 1.], 
     [ 1., 1., 1., 1.]])

astype得到形狀全錯了

In [135]: arr.astype(np.float32) 
Out[135]: array([ 1., 1., 1.], dtype=float32)

view當組件dtype s是統一的，例如與2個浮動區域

In [136]: arr[['f0','f3']].copy().view(np.float32) 
Out[136]: array([ 1., 1., 1., 1., 1., 1.], dtype=float32)

但它確實需要重塑。 view使用databuffer字節，只是重新解釋。

許多recfunctions函數按字段複製使用字段。這裏相當於

In [138]: res = np.empty((3,4),'float32') 
In [139]: for i in range(4): 
    ...:  res[:,i] = arr[arr.dtype.names[i]] 
    ...:  
In [140]: res 
Out[140]: 
array([[ 1., 1., 1., 1.], 
     [ 1., 1., 1., 1.], 
     [ 1., 1., 1., 1.]], dtype=float32)

如果與記錄數相比字段的數量很少，這個迭代並不昂貴。

def foo(arr): 
    res = np.empty((arr.shape[0],4), np.float32) 
    for i in range(4): 
     res[:,i] = arr[arr.dtype.names[i]] 
    return res

使用大4場陣列中，由場複製顯然是更快：

In [143]: arr = np.ones(10000, dtype=dt) 
In [149]: timeit x1 = foo(arr) 
10000 loops, best of 3: 73.5 µs per loop 
In [150]: timeit x2 = np.array(arr.tolist(), np.float32) 
100 loops, best of 3: 11.9 ms per loop

來源

2017-04-26 19:58:14 hpaulj

讀寫HDF5文件中的numpy陣列

回答

相關問題