使用Python加載Matlab稀疏矩陣Pytables

我最初問了一個相關的問題here，但是似乎並沒有真正得到任何地方。也許如果我更換它的一部分更具體可能會幫助....使用Python加載Matlab稀疏矩陣Pytables

我有文件存儲使用Matlab的稀疏格式（HDF5，CSC我相信），我試圖使用Pytables直接對它們進行操作，但尚未成功。使用h5py我可以做到以下幾點：

# Method 1: uses h5py (WORKS) 
f1 = h5py.File(fname) 
data = f1['M']['data'] 
ir = f1['M']['ir'] 
jc = f1['M']['jc'] 
M = scipy.sparse.csc_matrix((data, ir, jc))

，但如果我嘗試做Pytables等價的：與錯誤

# Method 2: uses pyTables (DOESN'T WORK) 
f2 = tables.openFile(fname) 
data = f2.root.M.data 
ir = f2.root.M.ir 
jc = f2.root.M.jc 
M = scipy.sparse.csc_matrix((data,ir,jc))

失敗（經過漫長的等待）：

TypeError         Traceback (most recent call last) 

/home/tdiethe/BMJ/<ipython console> in <module>() 

/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.pyc in __init__(self, arg1, shape, dtype, copy, dims, nzmax) 
    56      self.indices = np.array(indices, copy=copy) 
    57      self.indptr = np.array(indptr, copy=copy) 
---> 58      self.data = np.array(data, copy=copy, dtype=getdtype(dtype, data)) 
    59     else: 
    60      raise ValueError, "unrecognized %s_matrix constructor usage" %\ 

/usr/lib/python2.6/dist-packages/scipy/sparse/sputils.pyc in getdtype(dtype, a, default) 
    69     canCast = False 
    70    else: 
---> 71     raise TypeError, "could not interpret data type" 
    72  else: 
    73   newdtype = np.dtype(dtype) 

TypeError: could not interpret data type

看着f2：

In [63]: f2.root.M.data 
Out[63]: 
/M/data (CArray(4753606,), zlib(3)) '' 
    atom := Float64Atom(shape=(), dflt=0.0) 
    maindim := 0 
    flavor := 'numpy' 
    byteorder := 'little' 
    chunkshape := (8181,) 

In [64]: f2.root.M.ir 
Out[64]: 
/M/ir (CArray(4753606,), zlib(3)) '' 
    atom := UInt64Atom(shape=(), dflt=0) 
    maindim := 0 
    flavor := 'numpy' 
    byteorder := 'little' 
    chunkshape := (8181,) 

In [65]: f2.root.M.jc 
Out[65]: 
/M/jc (CArray(133339,), zlib(3)) '' 
    atom := UInt64Atom(shape=(), dflt=0) 
    maindim := 0 
    flavor := 'numpy' 
    byteorder := 'little' 
    chunkshape := (7843,)

我有兩個問題：

我怎麼使用pytables
我需要爲了能夠在其上執行操作執行轉換到SciPy的稀疏矩陣加載該文件，或者我可以直接在磁盤文件（矩陣乘法等）上執行操作 - 即不將文件加載到內存中（如果不是，使用pytables有什麼意義？）？

來源

2011-12-09 tdc

這些錯誤是走出來的SciPy的。你能否檢查一下在數據'，'ir'或'jc'操作數據的能力。 numpy對數據（即dtype，shape等）有什麼要說的？結果是你期望的嗎？它們是否與該電話簽名的''scipy.sparse.csc_matrix''中預期的相符？ – dtlussier

啊是的，似乎我所要做的只是： 'M = sparse.csc_matrix（（f2.root.M.data [...]，f2.root.M.ir [...] f2。 root.M.jc [...]））' 還不確定第二個問題嗎？ PyTables似乎只有元素操作可用？ – tdc

我錯過了在你原來的文章中看到這個，但我認爲你的問題是在PyTables的設計中，它提供了基礎數據之上的額外抽象級別。

考慮以下幾點：

>>> import tables 
>>> import numpy as np 

>>> h5_file = tables.openFile(fname) 
>>> data = f2.root.M.data

此時data不是numpy數組：

>>> type(data) 
tables.array.Array 

>>> isinstance(data, np.ndarray) 
False

的tables.array.Array並立即加載底層陣列，或立即暴露陣列一樣的功能。當您嘗試使用這些類型的對象在scipy中創建稀疏數組時，這就是導致錯誤的原因。

代替通過PyTables產生的data對象旨在提供存取通過附加命令的數據（即你沒有通過使用花式索引[...]）。在這種方法中，您可以通過執行data[:]或data.read()訪問部分數據或全部數據。只有在這個時候纔會生成熟悉的numpy數組。

有關tables.array.Array類的更多信息請參見http://pytables.github.com/usersguide/libref.html#the-array-class或http://www.pytables.org/moin/HowToUse的Getting actual data部用於訪問基礎數據的例子。

比較pyh5產生更多類似數組的對象，但仍然不是numpy數組。試想一下：

>>> import pyh5 
>>> f1 = h5py.File(fname) 
>>> data = f1['M']['data'] 
>>> type(data) 
h5py._hl.dataset.Dataset 
>>> isinstance(data, np.ndarray) 
>>> False

但是，你可以馬上做datanumpy操作，比如你打電話scipy，或者更簡單的操作，像np.cos(data)或data + np.arange(len(data))。看起來物體data也有一些類似屬性（即shape），並且底層數據（numpy.ndarray）存儲在data.value。不過，我並不熟悉pyh5，因爲我自己並沒有使用它，所以我不確定這方面的侷限性。

通常看起來PyTables和pyh5有不同的設計目標，因此應以不同的方式使用。 pyh5爲HDF文件提供了更類似Numpy的界面，而PyTables則提供了更復雜的數據庫操作。見的分歧進行討論了pyh5，PyTables文檔和Enthought郵件列表：

來源

2011-12-13 16:05:35 dtlussier

非常有用的答案謝謝。慢慢地抓住這個！我認爲PyTables將爲我們的項目提供一些有用的功能，因爲可伸縮性非常重要。 – tdc

太好了 - 很高興幫助。 – dtlussier

使用Python加載Matlab稀疏矩陣Pytables

回答

相關問題