加速將熊貓中的CSV轉換爲HDF5

我正在尋找一種加快此過程的方法。我有它的功能，但它需要幾天才能完成。
我有一年的每一天的數據文件。而且，我想將它們合併到一個HDF5文件中，每個數據標籤都有一個節點（數據標籤）。
的數據是這樣的：加速將熊貓中的CSV轉換爲HDF5

a,1468004920,986.078 
a,1468004921,986.078 
a,1468004922,987.078 
a,1468004923,986.178 
a,1468004924,984.078 
b,1468004920,986.078 
b,1468004924,986.078 
b,1468004928,987.078 
c,1468004924,98.608 
c,1468004928,97.078 
c,1468004932,98.078

注意，有不同數量的條目，併爲每個數據標籤不同的更新頻率。每個實際的數據文件在每個單日文件中有大約400萬行和大約4000個不同的標籤標籤，然後我有一年的數據。
下面的代碼做我想要的。但是爲每個文件運行它需要幾天時間才能完成。我在尋找的建議加快這：

import pandas as pd 
import datetime 
import pytz 
MI = pytz.timezone('US/Central') 

def readFile(file_name): 
    tmp_data=pd.read_csv(file_name,index_col=[1],names=['Tag','Timestamp','Value']) 
    tmp_data.index=pd.to_datetime(tmp_data.index,unit='s') 
    tmp_data.index.tz=MI 
    tmp_data['Tag']=tmp_data['Tag'].astype('category') 
    tag_names=tmp_data.Tag.unique() 
    for idx,name in enumerate(tag_names): 
     tmp_data.loc[tmp_data.Tag==name].Value.to_hdf('test.h5',name,complevel=9, complib='blosc',format='table',append=True) 

for name in ['test1.csv']: 
    readFile(name)

從本質上講，我試圖做的是「解包」的CSV數據，所以每個標籤在HDF5文件分開。所以，我想將所有標記爲「a」的數據放入一年的hdf5文件的單個葉，並將所有的「b」數據放入下一個葉等等。所以，我需要在每個數據庫上運行上面的代碼365個文件。我嘗試過，沒有壓縮，我也試過index = False。但是，似乎沒有很大的影響。

來源

2016-08-04 Adam

我會做這種方式：

MI = pytz.timezone('US/Central') 

tmp_data=pd.read_csv('test1.txt',index_col=[1],names=['Tag','Timestamp','Value']) 
tmp_data.index=pd.to_datetime(tmp_data.index,unit='s') 
tmp_data.index.tz=MI 

hdf_key = 'my_key' 

store = pd.HDFStore('/path/to/file.h5') 

for loop which processes all your CSV files: 
    # pay attention at index=False - we want to index everything at the end 
    store.append(hdf_key, tmp_data, complevel=9, complib='blosc', append=True, index=False) 

# all CSV files have been processed, let's index everything... 
store.create_table_index(hdf_key, columns=['Tag','Value'], optlevel=9, kind='full')

來源

2016-08-04 16:32:40 MaxU

我認爲，你的建議是使用指數=假。我曾嘗試沒有太多改變。但是，你的代碼會從循環中刪除重要的部分。 tmp_data是我的csv數據，並且對於每個文件，在我的實際代碼中，「test.txt」實際上是一個我將文件名放在其中的變量。所以，我然後把每個標籤放入不同的節點。您的代碼將整個文件放入單個節點中。 – Adam

爲了澄清，我剛編輯我的代碼，以使用一個函數。問題是如何讓readFile更快。 – Adam

加速將熊貓中的CSV轉換爲HDF5

回答

相關問題