Pandas HDFStore用於外核對可變大小的集合的順序讀取/寫入

我想逐步讀寫數據到hdf5文件，因爲我無法將數據放入內存。Pandas HDFStore用於外核對可變大小的集合的順序讀取/寫入

要讀取/寫入的數據是整數集。我只需要按順序讀取/寫入集合。不需要隨機訪問。就像我讀set1，然後set2，然後set3等

問題是，我不能通過索引檢索集。

import pandas as pd  
x = pd.HDFStore('test.hf', 'w', append=True) 
a = pd.Series([1]) 
x.append('dframe', a, index=True) 
b = pd.Series([10,2]) 
x.append('dframe', b, index=True) 
x.close() 

x = pd.HDFStore('test.hf', 'r') 
print(x['dframe']) 
y=x.select('dframe',start=0,stop=1) 
print("selected:", y) 
x.close()

輸出：

0  1 
0 10 
1  2 
dtype: int64 
selected: 0 1 
dtype: int64

它不選擇我的第0集，這是{1,10}

來源

2017-03-25 dot dot dot

'指數= FALSE' http://stackoverflow.com/questions/25714549/indexing-and-data-columns-in-pandas-pytables –

，你可以簡單地做這個：'y = x.select（'dframe'，start = 0，stop = 1 + 1）' – MaxU

@MaxU。但是這意味着我知道在我從文件中讀取之前，該集合有兩個元素，事實並非如此。當我讀取文件時，我不知道集合的大小。 –

這種方式工作。但我真的不知道這有多快。

這是否掃描整個文件來查找索引行？

這將是相當浪費時間。

import pandas as pd 

x = pd.HDFStore('test.hf', 'w', append=True, format="table", complevel=9) 
a = pd.Series([1]) 
x.append('dframe', a, index=True) 
b = pd.Series([10,2]) 
x.append('dframe', b, index=True) 
x.close() 

x = pd.HDFStore('test.hf', 'r') 
print(x['dframe']) 
y=x.select('dframe','index == 0') 
print('selected:') 
for i in y: 
    print(i) 
x.close()

輸出：

0  1 
0 10 
1  2 
dtype: int64 
selected: 
1 
10

來源

2017-03-25 14:07:06

中使用'data_columns = True'所做的那樣 - 是一種正確的方法，但您也應該創建HDF商店與'表'格式 - 'pd.HDFStore（'test.hf'，模式='W'，格式='表'，附加=真）' – MaxU

你可能想檢查[這個答案]（HTTP：///stackoverflow.com/a/41555615/5741205）進行一些性能測試...... – MaxU

@MaxU每個週期755毫秒是如此之差......必須像759997個週期那樣做，而且我只需要順序讀取集合而不是隨機讀取訪問。如果我編寫我自己的代碼來保存/讀取，它可以更快。 –

Pandas HDFStore用於外核對可變大小的集合的順序讀取/寫入

回答

相關問題