2014-11-24 21 views
2

我想要在HDF5文件中包含的大矩陣內指定特定的行。我有一個.txt文件,其中包含存在於HDF5文件中的感興趣的ID,並希望輸出這些行的相應數據 - 所有相應的數據都是數字。從HDF5訪問數據 - 切片/提取數據

我寫了下面的代碼,但輸出只包含id(單列)。我需要附加到這些行的剩餘數據(後續列中的數據)。任何意見將不勝感激!

import os 
import h5py 


mydir = os.path.expanduser("~/Desktop/alexs-stuff/") 

in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5" 
wanted_file = mydir + "EMP/greengenes-curto-only.txt" 
out_file = mydir + "EMP/emp-curto-only.txt" 

wanted = set() 

with open(wanted_file) as f: 
    for line in f: 
     line = line.strip() 
     if line != "": 
      wanted.add(line) 

hdf5_file = h5py.File(in_file, "r") 

count = 0 

with open(out_file, "w") as h: 
    for keys in hdf5_file["observation"]["ids"]: 
     if keys in wanted: 
      count = count + 1 
      h.write(keys + "\n") 

print "Converted %i records" % count 

hdf5_file.close() 

如果有幫助,這裏是HDF5文件的結構:

<HDF5 file "full_emp_table_hdf5.h5" (mode r)> (File)/
sample  /sample (Group) /sample 
    metadata  /sample/metadata (Group) /sample/metadata 
    group-metadata  /sample/group-metadata (Group) /sample/group-metadata 
    ids  /sample/ids (Dataset) /sample/ids  len = (15481,) object 
    matrix  /sample/matrix (Group) /sample/matrix 
     indices  /sample/matrix/indices (Dataset) /sample/matrix/indices  len = (107439386,) int32 
     indptr  /sample/matrix/indptr (Dataset) /sample/matrix/indptr  len = (15482,) int32 
     data  /sample/matrix/data (Dataset) /sample/matrix/data  len = (107439386,) float64 
observation  /observation (Group) /observation 
    metadata  /observation/metadata (Group) /observation/metadata 
     taxonomy  /observation/metadata/taxonomy (Dataset) /observation/metadata/taxonomy  len = (5594412, 7) object 
    group-metadata  /observation/group-metadata (Group) /observation/group-metadata 
    ids  /observation/ids (Dataset) /observation/ids  len = (5594412,) object 
    matrix  /observation/matrix (Group) /observation/matrix 
     indices  /observation/matrix/indices (Dataset) /observation/matrix/indices  len = (107439386,) int32 
     indptr  /observation/matrix/indptr (Dataset) /observation/matrix/indptr  len = (5594413,) int32 
     data  /observation/matrix/data (Dataset) /observation/matrix/data  len = (107439386,) float64 

附加信息:

type(hdf5_file['observation']['ids']) 
>>> <class 'h5py._hl.dataset.Dataset'> 


dir(hdf5_file['observation']['ids']) 
>>> ['__array__', '__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_d', '_dcpl', '_e', '_filters', '_id', '_lapl', '_lcpl', '_local', 'astype', 'attrs', 'chunks', 'compression', 'compression_opts', 'dims', 'dtype', 'file', 'fillvalue', 'fletcher32', 'id', 'len', 'maxshape', 'name', 'parent', 'read_direct', 'ref', 'regionref', 'resize', 'scaleoffset', 'shape', 'shuffle', 'size', 'value', 'write_direct'] 
+0

看起來這個數據來源於'biom-format'文件。 http://biom-format.org/你有沒有安裝'pypi'軟件包? – hpaulj 2015-01-01 22:30:28

回答

0

更改行:

h.write(keys + "\n") 

本:

h.write(hdf5_file['observation']['ids'][keys] + "\n") 
+0

嘿!感謝您的反饋。我收到以下錯誤: ValueError:只允許複合類型的字段名稱 是否有任何想法?對不起hdf5文件的新功能。 – 2014-11-25 20:36:21

+0

是什麼類型的'輸出(hdf5_file [ '觀察'] [ 'IDS']'嗎?還有,你能傾倒的'目錄(hdf5_file [ '觀察'] [ 'IDS']輸出'? – vikramls 2014-11-25 22:38:17

+0

添加在其上方再次感謝。 – 2014-11-26 02:15:00

1

我覺得這hdf5文件是由biom-format軟件,http://biom-format.org,生產從它的pypi包:https://pypi.python.org/packages/source/b/biom-format/biom-format-2.1.2.tar.gz

我可以提取樣本hdf5文件:examples/min_sparse_otu_table_hdf5.biom,並處理它:

import numpy as np 
from scipy import sparse as sp 
import h5py 

f = h5py.File('min_sparse_otu_table_hdf5.biom','r') 
f.visit(print) # to see its structure 
""" 
observation 
observation/group-metadata 
observation/ids 
observation/matrix 
observation/matrix/data 
observation/matrix/indices 
observation/matrix/indptr 
observation/metadata 
sample 
sample/group-metadata 
sample/ids 
sample/matrix 
sample/matrix/data 
sample/matrix/indices 
sample/matrix/indptr 
sample/metadata 
""" 

或更詳細地說:

f.visititems(lambda name,obj:print(name, obj)) 
""" 
... 
sample <HDF5 group "/sample" (4 members)> 
sample/group-metadata <HDF5 group "/sample/group-metadata" (0 members)> 
sample/ids <HDF5 dataset "ids": shape (6,), type "|O4"> 
sample/matrix <HDF5 group "/sample/matrix" (3 members)> 
sample/matrix/data <HDF5 dataset "data": shape (15,), type "<f8"> 
sample/matrix/indices <HDF5 dataset "indices": shape (15,), type "<i4"> 
sample/matrix/indptr <HDF5 dataset "indptr": shape (7,), type "<i4"> 
sample/metadata <HDF5 group "/sample/metadata" (0 members)> 
""" 

這也有類似的結構,你full_emp_table_hdf5.h5

負荷陣列,.value(或[:]

sample_ids = f['sample/ids'].value 
""" 
array([b'Sample1', b'Sample2', b'Sample3', b'Sample4', b'Sample5', 
     b'Sample6'], dtype=object) 
""" 

# create a sparse array from the sample/matrix group 
data = f['sample/matrix/data'].value 
indices = f['sample/matrix/indices'].value 
indptr = f['sample/matrix/indptr'].value 
sample = sp.csr_matrix((data,indices,indptr)) 
# display as a dense array 
print(sample.A) 

""" 
array([[ 0., 5., 0., 2., 0.], 
     [ 0., 1., 0., 1., 1.], 
     [ 1., 0., 1., 1., 1.], 
     [ 0., 2., 4., 0., 0.], 
     [ 0., 3., 0., 0., 0.], 
     [ 0., 1., 2., 1., 0.]]) 
""" 

# select a specific row 
print(sample[2,:]) 
""" 
    (0, 0) 1.0 
    (0, 2) 1.0 
    (0, 3) 1.0 
    (0, 4) 1.0 
""" 
print(sample[2,:].A) # that row in dense format 
# array([[ 1., 0., 1., 1., 1.]]) 

我可以寫sample到一個文本文件:

np.savetxt('min_sample.txt',sample.A, '%.2f') 

""" 
0.00 5.00 0.00 2.00 0.00 
0.00 1.00 0.00 1.00 1.00 
1.00 0.00 1.00 1.00 1.00 
0.00 2.00 4.00 0.00 0.00 
0.00 3.00 0.00 0.00 0.00 
0.00 1.00 2.00 1.00 0.00 
""" 

biom模塊可能可以做到所有這些和更多,但基本只需要numpysparse模塊。

+0

謝謝@hpaulj!我放棄了嘗試python的方法,並想出瞭如何提取MatLab上的數據,我將回過頭來嘗試使用你所建議的模塊,而且,你是對的,hdf5來源於一個biom文件。但是,biom文件太大而無法加載到內存中(2.6 GB),所以我試圖使用hdf5來代替。再次感謝您的幫助! – 2015-01-05 21:19:30

0

使用python我沒有找到一個簡單的答案,但如果其他人遇到類似的問題,使用:

$biom subset-table -i input_file.biom -a observation -s ids_wanted.txt -o biom 

然後可以轉換這樣的:

$biom convert -i outputfilefromabove.biom -o outputfilefromabove.txt --to_tsv 

這種方法是這樣更容易和利用在biom偉大的資源!