帕拉數據局部性

我有一個關於在黑斑羚數據局部性問題，可以說我有10個數據節點的集羣（每個數據節點上有impalad），如果我在黑斑羚SELECT * FROM big_table where dt='2017' where blabla orderby blabla group by blabla執行查詢（可以說，它是一個大的查詢）。帕拉數據局部性

，並讓說，在分區中的文件（DT = 2017' 年）是DN 1,3,5 所以如果我執行查詢將協調只能使用數據局部性或者守護進程1,3,5將使用所有的守護進程，其他守護進程將遠程讀取這些數據？

來源

2017-02-09 commando

簡短回答你的問題：它只使用守護進程1,3,5作爲數據局部性。

這通常是一個調度問題。 Impala在simple-scheduler.cc中作出此類決定。

// We schedule greedily in this order: 
// cached collocated replicas > collocated replicas > remote (cached or not) replicas.

如果有一個後端共置，Impala將不會使用其他後端掃描數據節點。對於沒有掃描節點的片段（如分區聚合節點），impala將它們放在與輸入片段所在的位置相同的位置。

// there is no leftmost scan; we assign the same hosts as those of our 
    // leftmost input fragment (so that a partitioned aggregation fragment 
    // runs on the hosts that provide the input data)

來源

2017-03-12 04:19:21 Amos

帕拉數據局部性

回答

相關問題