我剛剛瞭解到Hive中的collect_set()函數,並且我開始了一個開發3節點集羣的工作。Hive(Hadoop)中的COLLECT_SET()
我只有約10 GB的處理。但是,這項工作從字面上看是永遠存在的。我認爲可能有一個執行collect_set()的bug,我的代碼中存在一個bug,或者collect_set()函數真的是資源密集型的。
,這裏是我的蜂巢SQL(沒有雙關語意):
INSERT OVERWRITE TABLE sequence_result_1
SELECT sess.session_key as session_key,
sess.remote_address as remote_address,
sess.hit_count as hit_count,
COLLECT_SET(evt.event_id) as event_set,
hit.rsp_timestamp as hit_timestamp,
sess.site_link as site_link
FROM site_session sess
JOIN (SELECT * FROM site_event
WHERE event_id = 274 OR event_id = 284 OR event_id = 55 OR event_id = 151) evt
ON (sess.session_key = evt.session_key)
JOIN site_hit hit ON (sess.session_key = evt.session_key)
GROUP BY sess.session_key, sess.remote_address, sess.hit_count, hit.rsp_timestamp, sess.site_link
ORDER BY hit_timestamp;
有4 MR通過。第一次花了大約30秒。第二個地圖花了大約1分鐘。第二次減少大部分需要大約2分鐘。在過去的兩個小時裏,它從97.71%增長到97.73%。這是正確的嗎?我認爲肯定有一些問題。我看了看日誌,不知道是否正常。
[日誌的示例]
2011-06-21 16:32:22,715 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 120894
2011-06-21 16:32:22,758 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size = 108804
2011-06-21 16:32:23,003 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5142000000 rows
2011-06-21 16:32:23,003 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5142000000 rows
2011-06-21 16:32:24,138 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5143000000 rows
2011-06-21 16:32:24,138 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5143000000 rows
2011-06-21 16:32:24,725 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Tbl flush: #hash table = 120894
2011-06-21 16:32:24,768 INFO org.apache.hadoop.hive.ql.exec.GroupByOperator: 6 forwarding 42000000 rows
2011-06-21 16:32:24,771 WARN org.apache.hadoop.hive.ql.exec.GroupByOperator: Hash Table flushed: new size = 108804
2011-06-21 16:32:25,338 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5144000000 rows
2011-06-21 16:32:25,338 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5144000000 rows
2011-06-21 16:32:26,467 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 5145000000 rows
2011-06-21 16:32:26,468 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 5145000000 rows
我很新的這一點,並試圖與collect_set工作()和蜂巢陣列是推動我過深結束。
感謝提前:)
不知道事實表右側的規則,但這裏已經是這樣了。很好記住。我會嘗試一下並讓你知道。 – batman
運行起來,初始部分稍快,但現在停留在97.71%的鄰居。也許這就是運行collect_set()函數的閾值百分比。 – batman
完全不同的問題。謝謝,我的回答如上。 – batman