基於2D條件對大型3D HDF5數據集進行子集化索引

我有一個大的3D HDF5數據集，它表示某個變量的位置（X，Y）和時間。接下來，我有一個2D numpy數組，其中包含相同（X，Y）位置的分類。我想要實現的是，我可以從3D HDF5數據集中提取屬於2D數組中某個類的所有時間序列。基於2D條件對大型3D HDF5數據集進行子集化索引

這裏是我的例子：

import numpy as np 
import h5py 

# Open the HDF5 dataset 
NDVI_file = 'NDVI_values.hdf5' 
f_NDVI = h5py.File(NDVI_file,'r') 
NDVI_data = f_NDVI["NDVI"] 

# See what's in the dataset 
NDVI_data 
<HDF5 dataset "NDVI": shape (1319, 2063, 53), type "<f4"> 

# Let's make a random 1319 x 2063 classification containing class numbers 0-4 
classification = np.random.randint(5, size=(1319, 2063))

現在，我們有我們的3D數據集HDF5和2D分類。讓我們來看看該課下數下降像素「3」

# Look for the X,Y locations that have class number '3' 
idx = np.where(classification == 3)

這將返回我大小2元組包含X，符合條件y對，在我隨便舉個例子對的量是544433 。現在該如何使用idx變量創建一個包含分類類號爲「3」的像素的544433時間序列的二維大小數組（544433,53）？

我做了一些測試用花哨的索引和純3D numpy的陣列和這個例子會工作得很好：

subset = 3D_numpy_array[idx[0],idx[1],:]

然而，HDF5的數據集過大轉換爲numpy的陣列;當我試圖直接在HDF5數據集使用相同的索引方法：

# Try to use fancy indexing directly on HDF5 dataset 
NDVI_subset = np.array(NDVI_data[idx[0],idx[1],:])

這引發了我的錯誤：

Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper  (C:\aroot\work\h5py\_objects.c:2584) 
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (C:\aroot\work\h5py\_objects.c:2543) 
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\dataset.py", line 431, in __getitem__ 
selection = sel.select(self.shape, args, dsid=self.id) 
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 95, in select 
sel[args] 
File "C:\Users\vtrichtk\AppData\Local\Continuum\Anaconda2\lib\site-packages\h5py\_hl\selections.py", line 429, in __getitem__ 
raise TypeError("Indexing elements must be in increasing order") 
TypeError: Indexing elements must be in increasing order

我想另一件事是np.repeat在第三分類數組用於創建與HDF5數據集形狀相匹配的3D數組。比idx變量換成大小3元組：

classification_3D = np.repeat(np.reshape(classification,(1319,2063,1)),53,axis=2) 
idx = np.where(classification == 3)

但除了拋出完全相同的錯誤下面的語句：

NDVI_subset = np.array(NDVI_data[idx])

這是因爲HDF5數據集的工作方式不同與純numpy的陣列？該文件確實說：「選擇座標必須按遞增順序給出」

有沒有人在這種情況下有一個建議，我怎麼可以得到這個工作，而不必讀取完整的HDF5數據集到內存中（這是行不通的）？非常感謝！

來源

2016-08-04 Kristof

什麼'h5py' DOC說的晚期或花哨的索引？我會研究這個，然後建立一個更小的測試用例，我可以在移動兩個3d之前測試二維數組上的這種索引。我可以在哪裏打印所有的值。 H5在編制索引時的確有可能受到限制。 – hpaulj

http://docs.h5py.org/zh/latest/high/dataset.html#fancy-indexing – hpaulj

h5py中的高級/花式索引幾乎沒有np.ndarray一般。

設置一個小的測試情況：

In [202]: ind 
Out[202]: 
(array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], dtype=int32), 
array([0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2], dtype=int32), 
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)) 

In [203]: x[ind] 
Out[203]: array([ 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29]) 
In [204]: dset[ind] 
... 
TypeError: Indexing elements must be in increasing order

我可以索引上與像的列表的單個維度：dset[[1,2,3],...]，但重複指數

import h5py 
f=h5py.File('test.h5','w') 
dset=f.create_dataset('data',(5,3,2),dtype='i') 
dset[...]=np.arange(5*3*2).reshape(5,3,2) 
x=np.arange(5*3*2).reshape(5,3,2) 

ind=np.where(x%2)

我可以選擇所有奇數值值或更改順序會產生錯誤，dset[[1,1,2,2],...]或dset[[2,1,0],...]。 dset[:,[0,1],:]是好的。

幾片很好，dset[0:3,1:3,:]，或片和列表，dset[0:3,[1,2],:]。

但2列出dset[[0,1,2],[1,2],:]產生

TypeError: Only one indexing vector or array is currently allowed for advanced selection

所以對於np.where索引元組是錯在幾個方面。

我不知道這是多少是h5存儲的限制，以及h5py模塊中的不完整開發有多少。也許有點兩者。

因此，您需要從文件中加載更簡單的塊，然後對生成的numpy數組執行fancier索引。

在我odd values情況下，我只需要做：

In [225]: dset[:,:,1] 
Out[225]: 
array([[ 1, 3, 5], 
     [ 7, 9, 11], 
     [13, 15, 17], 
     [19, 21, 23], 
     [25, 27, 29]])

來源

2016-08-04 18:47:03 hpaulj

感謝您分享該示例，它很好地闡明瞭h5py索引的侷限性。我看到你的解決方案爲「奇數值」情況，但我擔心在我的情況下沒有這樣簡單的解決方案。我不得不爲每個時間步分開2D陣列，然後在Numpy陣列上做一些奇特的索引，但這看起來不是一個快速和優雅的解決方案。 – Kristof

基於2D條件對大型3D HDF5數據集進行子集化索引

回答

相關問題