2017-07-03 52 views
3

我想只讀取HDF5文件中的特定列並在這些列上傳遞條件。我擔心的是我不想將所有HDF5文件作爲數據幀存儲在內存中。我只想得到我的必要專欄和他們的條件。從hdf5文件讀取特定列並傳遞條件

columns=['col1', 'col2'] 
condition= "col2==1" 
groupname='\path\to\group' 
Hdf5File=os.path.join('path\to\hdf5.h5') 
with pd.HDFStore(Hdf5File, mode='r', format='table') as store: 
    if groupname in store: 
     df=pd.read_hdf(store, key=groupname, columns=columns, where=["col2==1"]) 

我得到一個錯誤:

TypeError: cannot pass a column specification when reading a Fixed format store. this store must be selected in its entirety

然後我用下面的線僅返回特定的列:

df=store[groupname][columns] 

但我不知道我可以通過它的條件。

+0

[Python的熊貓閱讀的可能的複製使用讀\ _hdf和HDFStore.select從HDF5文件的特定值(https://stackoverflow.com/questions/26302480/python-pandas-reading-specific-values-from -hdf5-files-using-read-hdf-and-hdfstor) –

回答

3

爲了能夠有條件地讀取HDF5文件,必須將它們保存爲table格式,並且必須對相應列進行索引。

演示:

df = pd.DataFrame(np.random.rand(100,5), columns=list('abcde')) 
df.to_hdf('c:/temp/file.h5', 'df_key', format='t', data_columns=True) 

In [10]: pd.read_hdf('c:/temp/file.h5', 'df_key', where="a > 0.5 and a < 0.75") 
Out[10]: 
      a   b   c   d   e 
3 0.744123 0.515697 0.005335 0.017147 0.176254 
5 0.555202 0.074128 0.874943 0.660555 0.776340 
6 0.667145 0.278355 0.661728 0.705750 0.623682 
8 0.701163 0.429860 0.223079 0.735633 0.476182 
14 0.645130 0.302878 0.428298 0.969632 0.983690 
15 0.633334 0.898632 0.881866 0.228983 0.216519 
16 0.535633 0.906661 0.221823 0.608291 0.330101 
17 0.715708 0.478515 0.002676 0.231314 0.075967 
18 0.587762 0.262281 0.458854 0.811845 0.921100 
21 0.551251 0.537855 0.906546 0.169346 0.063612 
..  ...  ...  ...  ...  ... 
68 0.610958 0.874373 0.785681 0.147954 0.966443 
72 0.619666 0.818202 0.378740 0.416452 0.903129 
73 0.500782 0.536064 0.697678 0.654602 0.054445 
74 0.638659 0.518900 0.210444 0.308874 0.604929 
76 0.696883 0.601130 0.402640 0.150834 0.264218 
77 0.692149 0.963457 0.364050 0.152215 0.622544 
85 0.737854 0.055863 0.346940 0.003907 0.678405 
91 0.644924 0.840488 0.151190 0.566749 0.181861 
93 0.710590 0.900474 0.061603 0.144200 0.946062 
95 0.601144 0.288909 0.074561 0.615098 0.737097 

[33 rows x 5 columns] 

UPDATE:

如果你不能改變HDF5文件,然後再考慮了以下技術:

In [13]: df = pd.concat([x.query("0.5 < a < 0.75") 
         for x in pd.read_hdf('c:/temp/file.h5', 'df_key', chunksize=10)], 
         ignore_index=True) 

In [14]: df 
Out[14]: 
      a   b   c   d   e 
0 0.744123 0.515697 0.005335 0.017147 0.176254 
1 0.555202 0.074128 0.874943 0.660555 0.776340 
2 0.667145 0.278355 0.661728 0.705750 0.623682 
3 0.701163 0.429860 0.223079 0.735633 0.476182 
4 0.645130 0.302878 0.428298 0.969632 0.983690 
5 0.633334 0.898632 0.881866 0.228983 0.216519 
6 0.535633 0.906661 0.221823 0.608291 0.330101 
7 0.715708 0.478515 0.002676 0.231314 0.075967 
8 0.587762 0.262281 0.458854 0.811845 0.921100 
9 0.551251 0.537855 0.906546 0.169346 0.063612 
..  ...  ...  ...  ...  ... 
23 0.610958 0.874373 0.785681 0.147954 0.966443 
24 0.619666 0.818202 0.378740 0.416452 0.903129 
25 0.500782 0.536064 0.697678 0.654602 0.054445 
26 0.638659 0.518900 0.210444 0.308874 0.604929 
27 0.696883 0.601130 0.402640 0.150834 0.264218 
28 0.692149 0.963457 0.364050 0.152215 0.622544 
29 0.737854 0.055863 0.346940 0.003907 0.678405 
30 0.644924 0.840488 0.151190 0.566749 0.181861 
31 0.710590 0.900474 0.061603 0.144200 0.946062 
32 0.601144 0.288909 0.074561 0.615098 0.737097 

[33 rows x 5 columns] 
+0

我對HDF5文件擁有隻讀訪問權限,我不想再保存它們,因爲它們是大文件。 – Safariba

+0

@Safariba,請檢查更新 – MaxU