熊貓稀疏數據幀在磁盤上比密版本

大我發現，當保存到磁盤比密集版本數據幀的稀疏版本實際上要大得多。我究竟做錯了什麼？熊貓稀疏數據幀在磁盤上比密版本

test = pd.DataFrame(ones((4,4000))) 
test.ix[:,:] = nan 
test.ix[0,0] = 47 

test.to_hdf('test3', 'df') 
test.to_sparse(fill_value=nan).to_hdf('test4', 'df') 

test.to_pickle('test5') 
test.to_sparse(fill_value=nan).to_pickle('test6') 

.... 
ls -sh test* 
200K test3 16M test4 164K test5 516K test6

使用版本0.12.0

我最終會想有效地存儲10^7 60點的陣列，其中約10％的密度，然後將它們拉入熊貓dataframes和他們一起玩。

編輯：感謝傑夫回答原來的問題。後續問題：這似乎只是節省了酸洗，而不是使用其他格式如HDF5。正在酸洗我的最佳路線？

print shape(array_activity) #This is just 0s and 1s 
(1020000, 60) 

test = pd.DataFrame(array_activity) 
test_sparse = test.to_sparse() 
print test_sparse.density 
0.0832333496732 

test.to_hdf('1', 'df') 
test_sparse.to_hdf('2', 'df') 
test.to_pickle('3') 
test_sparse.to_pickle('4') 
!ls -sh 1 2 3 4 
477M 1 544M 2 477M 3 83M 4

這是一個數據，作爲Matlab .mat文件中的索引列表小於12M。我急於將它變成HDF5/Pytables格式，以便我可以抓取特定的索引（其他文件更大，並且需要更長的時間才能加載到內存中），然後輕鬆地爲他們做Pandasy的事情。也許我不正確的做法？

來源

2014-02-06 jeffalstott

添加一個壓縮濾波器，在這裏看到：http://pandas.pydata.org/pandas-docs/dev/io.html#compression – Jeff

具有致密數據幀和complevel = 9和complib =「blosc」，即下降我們從544M到26M。好得多，但仍然跟不上12M。利用稀疏數據幀試圖壓縮拋出一個類型錯誤： '類型錯誤：不能正確地創建用於貯藏器：[_TABLE_MAP] [組 - >/test_sparse（集團）「」，值 - ><類「pandas.sparse.frame.SparseDataFrame '>，table-> True，append-> True，kwargs - > {'encoding'：None}]' – jeffalstott

hmm ....這不是正確的格式;它應該用table = False保存;但這也是默認設置。讓我看一看。 – Jeff

要創建具有4000列和只有4行的幀;稀疏處理是按行進行處理的，所以反向維度。

In [2]: from numpy import * 

In [3]: test = pd.DataFrame(ones((4000,4))) 

In [4]: test.ix[:,:] = nan 

In [5]: test.ix[0,0] = 47 

In [6]: test.to_hdf('test3', 'df') 

In [7]: test.to_sparse(fill_value=nan).to_hdf('test4', 'df') 

In [8]: test.to_pickle('test5') 

In [9]: test.to_sparse(fill_value=nan).to_pickle('test6') 

In [11]: !ls -sh test3 test4 test5 test6 
164K test3 148K test4 160K test5 36K test6

後續。你提供你的店是寫在table格式，並因此節省了密集的版本（稀疏不支持表格格式，這是非常靈活的，可查詢，請參閱docs。

此外，您可能需要使用節能試驗。您的文件使用稀疏格式的2所不同的表述

所以，這裏有一個樣本會話：

df = 
In [1]: df = pd.read_hdf('store_compressed.h5','test') 

In [2]: type(df) 
Out[2]: pandas.core.frame.DataFrame 

In [3]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9) 

In [4]: df.to_sparse(kind='integer').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9) 

In [5]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9) 

In [6]: df.to_sparse(kind='integer').to_hdf('test_integer.h5','test',mode='w',complib='blosc',complevel=9) 

In [7]: df.to_hdf('test_dense_fixed.h5','test',mode='w',complib='blosc',complevel=9) 

In [8]: df.to_hdf('test_dense_table.h5','test',mode='w',format='table',complib='blosc',complevel=9) 

In [9]: !ls -ltr *.h5 
-rwxrwxr-x 1 jreback users 57015522 Feb 6 18:19 store_compressed.h5 
-rw-rw-r-- 1 jreback users 30335044 Feb 6 19:01 test_block.h5 
-rw-rw-r-- 1 jreback users 28547220 Feb 6 19:02 test_integer.h5 
-rw-rw-r-- 1 jreback users 44540381 Feb 6 19:02 test_dense_fixed.h5 
-rw-rw-r-- 1 jreback users 57744418 Feb 6 19:03 test_dense_table.h5

IIRC他們是在0.12在to_hdf的錯誤沒有通過所有的參數通，所以你問題想要使用：

with get_store('test.h5',mode='w',complib='blosc',complevel=9) as store: 
    store.put('test',df)

這些基本上是存儲爲SparseSeries集合因此，如果濃度低，不連續的話，就不會像最小盡可能大小去。儘管YMMV，Pandas稀疏套件可以更好地處理較少數量的連續塊。 scipy也提供了一些稀疏的處理工具。

雖然恕我直言，這些都是非常瑣碎的尺寸HDF5文件無論如何，你可以處理的行數巨大;並且可以輕鬆處理文件大小爲10和100的千兆字節（儘管推薦）。

而且你可能會考慮使用一個表格式，如果這確實是，你可以查詢查找表。

來源

2014-02-06 18:29:58 Jeff

熊貓稀疏數據幀在磁盤上比密版本

回答

相關問題