2016-10-21 51 views
1

將數百萬行插入到PostgreSQL數據庫時,我遇到了性能問題。將大量數據插入Postgresql

我發送一個JSON對象,其中有一個數組與萬億行。

對於每一行,我在數據庫表中創建一條記錄。我也嘗試過多次插入,但問題依然存在。

我不知道如何處理這個問題,我讀過COPY命令是緊固件。

我該如何提高性能?

我的JSON對象以日誌作爲數組: 數組Log有一個萬億行。

{"type":"monitoring","log":[ 
["2016-10-12T20:33:21","0.00","0.00","0.00","0.00","0.0","24.00","1.83","-0.00","1","1","-100.00"], 
["2016-10-12T20:33:23","0.00","0.00","0.00","0.00","0.0","24.00","1.52","-0.61","1","1","-100.00"]]} 

我當前的代碼(我建立一個動態語句,這樣我可以同時執行多個行):

IF(NOT b_first_line) THEN 
      s_insert_query_values = right(s_insert_query_values, -1); --remove te leading comma 

      EXECUTE format('INSERT INTO log_rlda 
        (record_node_id, log_line, log_value, timestamp, record_log_id) 
      VALUES %s;', s_insert_query_values); 

      s_insert_query_values = ''; 
      i_num_lines_buffered = 0; 
     END IF; 
     END IF; 

s_insert_query_values包含:

裏面的每個值「log」中的數組需要被插入到它自己的行中(以colum:log_value)。這是INSERT的樣子(參考s_insert_query_values):

INSERT INTO log_rlda 
        (record_node_id, log_line, log_value, timestamp, record_log_id) 
      VALUES 
    (806, 1, 0.00, '2016-10-12 20:33:21', 386), 
    (807, 1, 0.00, '2016-10-12 20:33:21', 386), 
    (808, 1, 0.00, '2016-10-12 20:33:21', 386), 
    (809, 1, 0.00, '2016-10-12 20:33:21', 386), 
    (810, 1, 0.0, '2016-10-12 20:33:21', 386), 
    (811, 1, 24.00, '2016-10-12 20:33:21', 386), 
    (768, 1, 1.83, '2016-10-12 20:33:21', 386), 
    (769, 1, 0.00, '2016-10-12 20:33:21', 386), 
    (728, 1, 1, '2016-10-12 20:33:21', 386), 
    (771, 1, 1, '2016-10-12 20:33:21', 386), 
    (729, 1, -100.00, '2016-10-12 20:33:21', 386), 
    (806, 2, 0.00, '2016-10-12 20:33:23', 386), 
    (807, 2, 0.00, '2016-10-12 20:33:23', 386), 
    (808, 2, 0.00, '2016-10-12 20:33:23', 386), 
    (809, 2, 0.00, '2016-10-12 20:33:23', 386), 
    (810, 2, 0.0, '2016-10-12 20:33:23', 386), 
    (811, 2, 24.00, '2016-10-12 20:33:23', 386), 
    (768, 2, 1.52, '2016-10-12 20:33:23', 386), 
    (769, 2, -0.61, '2016-10-12 20:33:23', 386), 
    (728, 2, 1, '2016-10-12 20:33:23', 386), 
    (771, 2, 1, '2016-10-12 20:33:23', 386), 
    (729, 2, -100.00, '2016-10-12 20:33:23', 386) 

解決方案(i_node_id_list包含此查詢之前選擇的ID的I):

SELECT i_node_id_list[log_value_index] AS record_node_id, 
        e.log_line-1 AS log_line, 
        items.log_value::double precision as log_value, 
        to_timestamp((e.line->>0)::text, 'YYYY-MM-DD HH24:MI:SS') as "timestamp", 
        i_log_id as record_log_id 
       FROM (VALUES (log_data::json)) as data (doc), 
       json_array_elements(doc->'log') with ordinality as e(line, log_line), 
       json_array_elements_text(e.line)  with ordinality as items(log_value, log_value_index) 
       WHERE log_value_index > 1 --dont include timestamp value (shouldnt be written as log_value) 
       AND log_line > 1 

回答

1

你需要unnesting的兩個層次。

select e.log_line, items.log_value, e.line -> 0 as timestamp 
from (
    values ('{"type":"monitoring","log":[ 
    ["2016-10-12T20:33:21","0.00","0.00","0.00","0.00","0.0","24.00","1.83","-0.00","1","1","-100.00"], 
    ["2016-10-12T20:33:23","0.00","0.00","0.00","0.00","0.0","24.00","1.52","-0.61","1","1","-100.00"]]}'::json) 
) as data (doc), 
    json_array_elements(doc->'log') with ordinality as e(line, log_line), 
    json_array_elements(e.line) with ordinality as items(log_value, log_value_index) 
where log_value_index > 1; 

第一次調用json_array_elements()提取從log屬性的所有數組元素。 with ordinality允許我們識別該陣列中的每一行。第二次調用然後從行中獲取每個元素,再次with ordinality允許我們找出陣列中的位置。

上面的查詢將返回此:

log_line | log_value | timestamp    
---------+-----------+---------------------- 
     1 | "0.00" | "2016-10-12T20:33:21" 
     1 | "0.00" | "2016-10-12T20:33:21" 
     1 | "0.00" | "2016-10-12T20:33:21" 
     1 | "0.00" | "2016-10-12T20:33:21" 
     1 | "0.0"  | "2016-10-12T20:33:21" 
     1 | "24.00" | "2016-10-12T20:33:21" 
     1 | "1.83" | "2016-10-12T20:33:21" 
     1 | "-0.00" | "2016-10-12T20:33:21" 
     1 | "1"  | "2016-10-12T20:33:21" 
     1 | "1"  | "2016-10-12T20:33:21" 
     1 | "-100.00" | "2016-10-12T20:33:21" 
     2 | "0.00" | "2016-10-12T20:33:23" 
     2 | "0.00" | "2016-10-12T20:33:23" 
     2 | "0.00" | "2016-10-12T20:33:23" 
     2 | "0.00" | "2016-10-12T20:33:23" 
     2 | "0.0"  | "2016-10-12T20:33:23" 
     2 | "24.00" | "2016-10-12T20:33:23" 
     2 | "1.52" | "2016-10-12T20:33:23" 
     2 | "-0.61" | "2016-10-12T20:33:23" 
     2 | "1"  | "2016-10-12T20:33:23" 
     2 | "1"  | "2016-10-12T20:33:23" 
     2 | "-100.00" | "2016-10-12T20:33:23" 

上述語句的結果隨後可用於直接插入數據而不遍歷它。這應該快很多,然後做很多單獨的插入。

但我不確定如何將正確的record_node_idrecord_log_id整合到上述結果中。

+0

謝謝你的快速回答 數據以不同的方式unnested。 對於「行數組」中的每個值,除第一個值外,行都插入到log_rlda中。 log_rlda中列的說明: record_node_id指記錄節點,它描述值的含義。 log_line日誌數據 行數log_value每個值 時間戳「行陣」在 record_log_id的第一個值是指一組數據 我answerd你的第一個問題,什麼「s_insert_query_values」包含的內容。 – Vern

+0

@Vern:看我的編輯 –

+0

解決集成正確的record_node_id的問題,我編輯了以下語句,從我之前收集的ID數組中選擇正確的ID。 我使用log_value_index來遍歷它。它工作得很好,迄今爲止我已經獲得了顯着的性能提升。我爲該主題添加了解決方案 – Vern