2016-11-22 162 views
1

我正在閱讀幾個大型(〜700mb)CSV文件以轉換爲數據幀,這些數據幀將全部組合爲一個CSV。現在每個CSV都是按每個CSV中的date列索引的。所有的CSV都有重疊的日期,但有獨特的測試位置。每個CSV都由其測試位置命名(例如BER.csv和alt.csv用於BER和ALT測試站點)。我怎樣才能像這樣多索引?現在,我有:將列添加到多索引的pandas數據框中

def openFile(filesToProcess): 
    df1 = pd.DataFrame() 
    counter = 0 
    for input in filesToProcess: 
     base = os.path.splitext(basename(input))[0] 
     print "Working on %s" % base 
     with open(input, 'r') as input_file: 
      #row_count = sum(1 for row in input_file) 
      if counter == 0: 
       df1 = createDataFrame(input_file) 
      else: 
       df2 = createDataFrame(input_file) 
       df1 = pd.concat([df1,df2]) 
     counter += 1 
     input_file.close() 
    df1.to_csv('large.csv') 

def createDataFrame(input_file): 
    checkTime = time.clock() 
    #print "Start DataFrame -- #%d" % counter 
    df1 = pd.read_csv(input_file, 
      sep = ",", 
      nrows = 500, 
      index_col = ['Date']) 
    #print "End DataFrame -- #%d" % counter 
    #print "Ran for " + str(time.clock() - checkTime) + " Seconds" 
    return df1 

因此,舉例來說,我想

date, testsite, data1, data2 
1/1/1992 9:15:00, ber, 89, 200 
1/1/1992 9:17:00, ber, 54, 103.3 
1/1/1992 9:15:00, alt, 90, 109.23 
1/1/1992 9:17:00, alt, 12, 110.1 

其中datetestsite是多指數

回答

0

設置

ber_df = pd.DataFrame([[89, 200], [54, 103.3]], 
         pd.DatetimeIndex(['1/1/1992 9:15:00', '1/1/1992 9:17:00'], 
             name='date'), 
         ['data1', 'data2']) 


alt_df = pd.DataFrame([[90, 109.23], [12, 110.1]], 
         pd.DatetimeIndex(['1/1/1992 9:15:00', '1/1/1992 9:17:00'], 
             name='date'), 
         ['data1', 'data2']) 


ber_df.to_csv('ber.csv') 

alt_df.to_csv('alt.csv') 

溶液

filesToProcess = ['ber.csv', 'alt.csv'] 

def parse_file(fn): 
    return pd.read_csv(fn, index_col=0, parse_dates=[0]) 

pd.concat({fn.replace('.csv', ''): parse_file(fn) for fn in filesToProcess}) \ 
    .rename_axis(['testsite', 'date'], axis=0).swaplevel(0, 1).reset_index() 

enter image description here

相關問題