2012-07-07 66 views
3

這裏可以真的使用一些幫助。努力用大數據顯示儀表板。使用大數據集和紅寶石

平均@ 2秒處理@ 2k記錄時。

MySql Console中的查詢用不到3.5秒的時間返回150k行。 Ruby中的相同查詢需要4分鐘以上的時間才能執行查詢,直到所有對象都準備就緒。

目標:在添加緩存服務器之前進一步優化數據。使用Ruby 1.9.2工作來說,Rails 3.0和MySQL(Mysql2寶石)

問題:

  • 是否與哈希的工作會影響性能?
  • 我應該首先將所有內容放在一個主散列中,然後處理我需要的數據嗎?
  • 還有什麼我可以做的,以幫助性能?

排在DB:

  • GasStations和美國人口普查有@ 15萬條記錄
  • 人都有@ 10萬條記錄
  • 汽車已@ 20萬條記錄
  • FillUps具有@ 230萬

儀表板要求(基於最近24小時的時間段查詢,l星期等)。所有以JS格式返回的數據。

  • 加油站,與FillUps和美國人口普查數據(郵編,姓名,城市,人口)
  • 前20個城市最充滿跌宕
  • 十大汽車行駛填寫跌宕
  • 汽車分組通過他們多少次,填補了他們的坦克

代碼(6個月的樣本返回@ 100K +的記錄。):

# for simplicity, removed the select clause I had, but removing data I don't need like updated_at, gas_station.created_at, etc. instead of returning all the columns for each table. 
@primary_data = FillUp.includes([:car, :gas_staton, :gas_station => {:uscensus}]).where('fill_ups.created_at >= ?', 6.months.ago) # This would take @ 4 + minutes 

# then tried 

@primary_data = FillUp.find_by_sql('some long sql query...') # took longer than before. 
# Note for others, sql query did some pre processing for me which added attributes to the return. Query in DB Console took < 4 seconds. Because of these extra attributes, query took longer as if Ruby was checking each row for mapping attributes 

# then tried 

MY_MAP = Hash[ActiveRecord::Base.connection.select_all('SELECT thingone, thingtwo from table').map{|one| [one['thingone'], one['thingtwo']]}] as seen http://stackoverflow.com/questions/4456834/ruby-on-rails-storing-and-accessing-large-data-sets 
# that took 23 seconds and gained mapping of additional data that was processing later, so much faster 

# currently using below which takes @ 10 seconds 
# All though this is faster, query still only takes 3.5 seconds, but parsing it to the hashes does add overhead. 
cars = {} 
gasstations = {} 
cities = {} 
filled = {} 

client = Mysql2::Client.new(:host => "localhost", :username => "root") 
client.query("SELECT sum(fill_ups_grouped_by_car_id) as filled, fillups.car_id, cars.make as make, gasstations.name as name, ....", :stream => true, :as => :json).each do |row| 
    # this returns fill ups gouged by car ,fill_ups.car_id, car make, gas station name, gas station zip, gas station city, city population 
    if cities[row['city']] 
    cities[row['city']]['fill_ups'] = (cities[row['city']]['fill_ups'] + row['filled']) 
    else 
    cities[row['city']] = {'fill_ups' => row['filled'], 'population' => row['population']} 
    end 
    if gasstations[row['name']] 
    gasstations[row['name']]['fill_ups'] = (gasstations[row['name']]['fill_ups'] + row['filled']) 
    else 
    gasstations[row['name']] = {'city' => row['city'],'zip' => row['city'], 'fill_ups' => row['filled']} 
    end 
    if cars[row['make']] 
    cars[row['make']] = (cars[row['make']] + row['filled']) 
    else 
    cars[row['make']] = row['filled'] 
    end 
    if row['filled'] 
    filled[row['filled']] = (filled[row['filled']] + 1) 
    else 
    filled[row['filled']] = 1 
    end 
end 

有以下型號:

def Person 
has_many :cars 
end 

def Car 
    belongs_to :person 
    belongs_to :uscensus, :foreign_key => :zipcode, :primary_key => :zipcode 
    has_many :fill_ups 
    has_many :gas_stations, :through => :fill_ups 
end 

def GasStation 
    belongs_to :uscensus, :foreign_key => :zipcode, :primary_key => :zipcode 
    has_many :fill_ups 
    has_many :cars, :through => :fill_ups 
end 

def FillUp 
    # log of every time a person fills up there gas 
    belongs_to :car 
    belongs_to :gas_station 
end 

def Uscensus 
    # Basic data about area based on Zip code 
end 
+0

這裏是選擇的東西:'Car','GasStation','FillUp'。恕我直言,它是更好的Ruby公約:'汽車','gas_station','fill_up' – Zabba 2012-07-07 18:18:47

+0

是否需要首先返回整個數據集,比如http://apidock.com/rails/ActiveRecord/Batches/ ClassMethods/find_in_batches – engineerDave 2012-07-07 18:33:53

+0

不,可以做find_in_batches,但ruby方法非常慢。將嘗試並報告結果。 – pcasa 2012-07-07 18:41:07

回答

2

我不使用的回報率,但返回100K行的儀表盤是永遠不會非常快。我強烈建議構建或維護彙總表並在數據庫中運行GROUP BY以在演示之前彙總數據集。

+0

我們嘗試了「捲起」表格,但是由於我們每15分鐘編寫一大塊新數據,因此所有彙總都需要每15分鐘更新一次,因此開始變得混亂。 – pcasa 2012-07-07 21:06:50

+0

儘管如此,每15分鐘可能比每次更新儀表板都要好。無論你做什麼,我都會盡量保持數據庫方面的彙總。 – 2012-07-07 21:54:20