我想要在HDF5文件中包含的大矩陣內指定特定的行。我有一個.txt文件,其中包含存在於HDF5文件中的感興趣的ID,並希望輸出這些行的相應數據 - 所有相應的數據都是數字。從HDF5訪問數據 - 切片/提取數據
我寫了下面的代碼,但輸出只包含id(單列)。我需要附加到這些行的剩餘數據(後續列中的數據)。任何意見將不勝感激!
import os
import h5py
mydir = os.path.expanduser("~/Desktop/alexs-stuff/")
in_file = mydir + "EMP/EMPopen/full_emp_table_hdf5.h5"
wanted_file = mydir + "EMP/greengenes-curto-only.txt"
out_file = mydir + "EMP/emp-curto-only.txt"
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
hdf5_file = h5py.File(in_file, "r")
count = 0
with open(out_file, "w") as h:
for keys in hdf5_file["observation"]["ids"]:
if keys in wanted:
count = count + 1
h.write(keys + "\n")
print "Converted %i records" % count
hdf5_file.close()
如果有幫助,這裏是HDF5文件的結構:
<HDF5 file "full_emp_table_hdf5.h5" (mode r)> (File)/
sample /sample (Group) /sample
metadata /sample/metadata (Group) /sample/metadata
group-metadata /sample/group-metadata (Group) /sample/group-metadata
ids /sample/ids (Dataset) /sample/ids len = (15481,) object
matrix /sample/matrix (Group) /sample/matrix
indices /sample/matrix/indices (Dataset) /sample/matrix/indices len = (107439386,) int32
indptr /sample/matrix/indptr (Dataset) /sample/matrix/indptr len = (15482,) int32
data /sample/matrix/data (Dataset) /sample/matrix/data len = (107439386,) float64
observation /observation (Group) /observation
metadata /observation/metadata (Group) /observation/metadata
taxonomy /observation/metadata/taxonomy (Dataset) /observation/metadata/taxonomy len = (5594412, 7) object
group-metadata /observation/group-metadata (Group) /observation/group-metadata
ids /observation/ids (Dataset) /observation/ids len = (5594412,) object
matrix /observation/matrix (Group) /observation/matrix
indices /observation/matrix/indices (Dataset) /observation/matrix/indices len = (107439386,) int32
indptr /observation/matrix/indptr (Dataset) /observation/matrix/indptr len = (5594413,) int32
data /observation/matrix/data (Dataset) /observation/matrix/data len = (107439386,) float64
附加信息:
type(hdf5_file['observation']['ids'])
>>> <class 'h5py._hl.dataset.Dataset'>
dir(hdf5_file['observation']['ids'])
>>> ['__array__', '__class__', '__delattr__', '__dict__', '__doc__', '__eq__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_d', '_dcpl', '_e', '_filters', '_id', '_lapl', '_lcpl', '_local', 'astype', 'attrs', 'chunks', 'compression', 'compression_opts', 'dims', 'dtype', 'file', 'fillvalue', 'fletcher32', 'id', 'len', 'maxshape', 'name', 'parent', 'read_direct', 'ref', 'regionref', 'resize', 'scaleoffset', 'shape', 'shuffle', 'size', 'value', 'write_direct']
看起來這個數據來源於'biom-format'文件。 http://biom-format.org/你有沒有安裝'pypi'軟件包? – hpaulj 2015-01-01 22:30:28