2015-06-09 41 views
0

我有5.5 GB大小的巨大CSV文件,它有超過100列。我只想導入CSV文件中的特定列。有什麼可能的方法來做到這一點?如何在rails上使用ruby導入大尺寸(5.5Gb)CSV文件到Postgresql?

我想將其導入到兩個不同的表中。只有一個字段到一個表,剩下的字段到另一個表中。

我應該在Postgresql或CSV類或SmartCSV類中使用COPY命令來達到這個目的嗎?

Regards, Suresh。

+0

您可以同時使用CSV類和SmartCSV,但我會建議你做作爲背景工作。你可以試試delayed_job .. –

+0

使用sidekiq你可以在後臺運行你的導入代碼。 Sidekiq允許在後臺運行自定義方法。 –

+0

這會給你更多解釋http://stackoverflow.com/questions/23140008/upload-a-csv-file-import-and-process-in-background –

回答

0

如果我有5Gb的CSV,我最好在沒有Rails的情況下導入它!但是,你可能需要的Rails的使用情況......

既然你說Rails的,我想你是在談論一個網絡請求和ActiveRecord的......

如果你不關心等待(並掛上你的服務器進程的一個實例)你可以這樣做:

之前,注意2件事情:1)使用臨時表,如果有錯誤,你不要惹你的目錄表 - 這是可選的, 當然。 2)使用-o選項截斷DEST表第一

控制器動作:

def updateDB 
    remote_file = params[:remote_file] ##<ActionDispatch::Http::UploadedFile> 
    truncate = (params[:truncate]=='true') ? true : false 
    if remote_file 
     result = Model.csv2tempTable(remote_file.original_filename, remote_file.tempfile) if remote_file 
     if result[:result] 
      Model.updateFromTempTable(truncate) 
      flash[:notice] = 'sucess.' 
     else 
      flash[:error] = 'Errors: ' + result[:errors].join(" ==>") 
     end 
    else 
     flash[:error] = 'Error: no file given.' 
    end 
    redirect_to somewhere_else_path 
end 

模型方法:

# References: 
# http://www.kadrmasconcepts.com/blog/2013/12/15/copy-millions-of-rows-to-postgresql-with-rails/ 
# http://stackoverflow.com/questions/14526489/using-copy-from-in-a-rails-app-on-heroku-with-the-postgresql-backend 
# http://www.postgresql.org/docs/9.1/static/sql-copy.html 
# 
def self.csv2tempTable(uploaded_name, uploaded_file) 
    erros = [] 
    begin 
     #read csv file 
     file = uploaded_file 
     Rails.logger.info "Creating temp table...\n From: #{uploaded_name}\n " 
     #init connection 
     conn = ActiveRecord::Base.connection 
     rc = conn.raw_connection 
     # remove columns created_at/updated_at 
     rc.exec "drop table IF EXISTS #{TEMP_TABLE}; " 
     rc.exec "create table #{TEMP_TABLE} (like #{self.table_name}); " 
     rc.exec "alter table #{TEMP_TABLE} drop column created_at, drop column updated_at;" 
     #copy it! 
     rc.exec("COPY #{TEMP_TABLE} FROM STDIN WITH CSV HEADER") 

     while !file.eof? 
      # Add row to copy data 
      l = file.readline 
      if l.encoding.name != 'UTF-8' 
       Rails.logger.info "line encoding is #{l.encoding.name}..." 
       # ENCODING: 
       # If the source string is already encoded in UTF-8, then just calling .encode('UTF-8') is a no-op, 
       # and no checks are run. However, converting it to UTF-16 first forces all the checks for invalid byte 
       # sequences to be run, and replacements are done as needed. 
       # Reference: http://stackoverflow.com/questions/2982677/ruby-1-9-invalid-byte-sequence-in-utf-8?rq=1 
       l = l.encode('UTF-16', 'UTF-8').encode('UTF-8', 'UTF-16') 
      end 
      Rails.logger.info "writing line with encoding #{l.encoding.name} => #{l[0..80]}" 
      rc.put_copy_data( l ) 
     end 
     # We are done adding copy data 
     rc.put_copy_end 
     # Display any error messages 
     while res = rc.get_result 
      e_message = res.error_message 
      if e_message.present? 
      erros << "Erro executando SQL: \n" + e_message 
      end 
     end 
    rescue StandardError => e 
     erros << "Error in csv2tempTable: \n #{e} => #{e.to_yaml}" 
    end 
    if erros.present? 
     Rails.logger.error erros.join("*******************************\n") 
     { result: false, erros: erros } 
    else 
     { result: true, erros: [] } 
    end 

end 

# copy from TEMP_TABLE into self.table_name 
# If <truncate> = true, truncates self.table_name first 
# If <truncate> = false, update lines from TEMP_TABLE into self.table_name 
# 
def self.updateFromTempTable(truncate) 
    erros = [] 
    begin 
     Rails.logger.info "Refreshing table #{self.table_name}...\n Truncate: #{truncate}\n " 
     #init connection 
     conn = ActiveRecord::Base.connection 
     rc = conn.raw_connection 
     # 
     if truncate 
      rc.exec "TRUNCATE TABLE #{self.table_name}" 
      return false unless check_exec(rc) 
      rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{TEMP_TABLE}" 
      return false unless check_exec(rc) 
     else 
      #remove lines from self.table_name that are present in temp 
      rc.exec "DELETE FROM #{self.table_name} WHERE id IN (SELECT id FROM #{FARMACIAS_TEMP_TABLE})" 
      return false unless check_exec(rc) 
      #copy lines from temp into self + includes timestamps 
      rc.exec "INSERT INTO #{self.table_name} SELECT *, '#{DateTime.now}' as created_at, '#{DateTime.now}' as updated_at FROM #{FARMACIAS_TEMP_TABLE};" 
      return false unless check_exec(rc) 
     end 

    rescue StandardError => e 
     Rails.logger.error "Error in updateFromTempTable: \n #{e} => #{e.to_yaml}" 
     return false 
    end 

    true 
end