2014-04-22 74 views
15

我想將字符串的可變長度列表存儲到HDF5數據集。該代碼,這是從Python中將字符串列表存儲到HDF5數據集

import h5py 
h5File=h5py.File('xxx.h5','w') 
strList=['asas','asas','asas'] 
h5File.create_dataset('xxx',(len(strList),1),'S10',strList) 
h5File.flush() 
h5File.Close() 

我得到一個錯誤,指出「類型錯誤:爲D型沒有轉換路徑:D型(」 & LT U3' )」 其中& LT是指除符號
如何實際少我可以解決這個問題嗎?

+0

對於初學者,您在'create_dataset'上有錯字。你能給出你正在使用的確切代碼,特別是在'strList'來自哪裏? – SlightlyCuban

+0

對於錯字感到抱歉,我試圖將熊貓數據幀序列化爲HDF5文件,所以我必須創建一個包含所有列名稱的標題,以便我提取列表中的列名並嘗試將其寫入到HDF5數據集。 – gman

+0

除了上面的代碼錯字模擬完全相似的情況 – gman

回答

14

您正在使用Unicode字符串閱讀,但將您的數據類型指定爲ASCII。根據the h5py wiki,h5py目前不支持此轉換。

你需要編碼字符串格式h5py處理:

asciiList = [n.encode("ascii", "ignore") for n in strList] 
h5File.create_dataset('xxx', (len(asciiList),1),'S10', asciiList) 

注:並非一切都在UTF-8編碼可以在ASCII編碼!

+0

謝謝你的工作完美 – gman

+0

從hdf5文件(在python3中)重新提取這些字符串的正確方法是什麼? – DilithiumMatrix

+0

@DilithiumMatrix ASCII也是有效的UTF-8,但是如果你需要'str'類型的話你可以使用'ascii.decode('utf-8')'。 注意:我的答案會丟棄非ASCII字符。如果你用'encode('unicode_escape')'保存了它們,那麼你需要'decode('unicode_escape')'將其轉換回來。 – SlightlyCuban

1

In HDF5, data in VL format is stored as arbitrary-length vectors of a base type. In particular, strings are stored C-style in null-terminated buffers. NumPy has no native mechanism to support this. Unfortunately, this is the de facto standard for representing strings in the HDF5 C API, and in many HDF5 applications.

Thankfully, NumPy has a generic pointer type in the form of the 「object」 (「O」) dtype. In h5py, variable-length strings are mapped to object arrays. A small amount of metadata attached to an 「O」 dtype tells h5py that its contents should be converted to VL strings when stored in the file.

Existing VL strings can be read and written to with no additional effort; Python strings and fixed-length NumPy strings can be auto-converted to VL data and stored.

Example

In [27]: dt = h5py.special_dtype(vlen=str) 

In [28]: dset = h5File.create_dataset('vlen_str', (100,), dtype=dt) 

In [29]: dset[0] = 'the change of water into water vapour' 

In [30]: dset[0] 
Out[30]: 'the change of water into water vapour' 
3

我在一個類似的情況希望數據框的列名存儲爲HDF5文件中的數據集。假設df.columns是我要存儲什麼,我發現了以下工作:

h5File = h5py.File('my_file.h5','w') 
h5File['col_names'] = df.columns.values.astype('S') 

這是假設的列名是可以在ASCII編碼的「簡單」的字符串。

相關問題