我正在使用當前運行大型(> 5GB).csv文件的系統。爲了提高性能,我正在測試(A)從磁盤創建數據幀的不同方法(pandas VS dask)以及(B)將結果存儲到磁盤(.csv VS hdf5文件)的不同方法。從CSV導入與HDF5相比,爲什麼熊貓和dask表現更好?
爲基準性能,我做了以下內容:
def dask_read_from_hdf():
results_dd_hdf = dd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_dd_hdf = results_dd_hdf.Security.unique()
hdf.close()
def pandas_read_from_hdf():
results_pd_hdf = pd.read_hdf('store.h5', key='period1', columns = ['Security'])
analyzed_stocks_pd_hdf = results_pd_hdf.Security.unique()
hdf.close()
def dask_read_from_csv():
results_dd_csv = dd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_dd_csv = results_dd_csv.Security.unique()
def pandas_read_from_csv():
results_pd_csv = pd.read_csv(results_path, sep = ",", usecols = [0], header = 1, names = ["Security"])
analyzed_stocks_pd_csv = results_pd_csv.Security.unique()
print "dask hdf performance"
%timeit dask_read_from_hdf()
gc.collect()
print""
print "pandas hdf performance"
%timeit pandas_read_from_hdf()
gc.collect()
print""
print "dask csv performance"
%timeit dask_read_from_csv()
gc.collect()
print""
print "pandas csv performance"
%timeit pandas_read_from_csv()
gc.collect()
我的研究結果是:
dask hdf performance
10 loops, best of 3: 133 ms per loop
pandas hdf performance
1 loop, best of 3: 1.42 s per loop
dask csv performance
1 loop, best of 3: 7.88 ms per loop
pandas csv performance
1 loop, best of 3: 827 ms per loop
當HDF5存儲可以快速比的.csv當DASK創建dataframes更快的訪問,爲什麼dask從hdf5慢於csv的dask?難道我做錯了什麼?
什麼時候從HDF5存儲對象創建dask數據幀對性能有意義?
使用木地板的一大+1 – MRocklin