使用Group By和Like Impala進行Impala查詢的性能下降

我們正在測試Apache Impala，並注意到使用GROUP BY和LIKE一起工作非常緩慢 - 單獨的查詢工作速度更快。這裏有兩個例子：使用Group By和Like Impala進行Impala查詢的性能下降

# 1.37s 1.08s 1.35s 

SELECT * FROM hive.default.pcopy1B where 
    (lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%') 
    or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') 
    or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%') 
    or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%') 
limit 100;

# 156.64s 155.63s 

select "by", type, ranking, count(*) from pcopy where 
    (lower("by") like '%part%' and lower("by") like '%and%' and lower("by") like '%the%') 
    or (lower(title) like '%part%' and lower(title) like '%and%' and lower(title) like '%the%') 
    or (lower(url) like '%part%' and lower(url) like '%and%' and lower(url) like '%the%') 
    or (lower(text) like '%part%' and lower(text) like '%and%' and lower(text) like '%the%') 
group by "by", type, ranking 
order by 4 desc limit 10;

可能有人請解釋爲什麼這個問題時，如果有任何變通辦法？

來源

2017-02-10 David542

這兩個查詢看起來與我很不一樣。第一個選擇記錄，只需要一個遊標，第二個必須檢索所有記錄並同時運行GROUP和SORT。如果返回的記錄非常多，這可能會解釋時間上的差異。或者我錯過了什麼？ – LSerni

2個查詢之間有一個基本的區別。

1查詢

要點：

只有100行被選中。
只要進程得到滿足提供的WHERE子句的100行，就會標記爲已完成，並返回100條記錄。
將只有1個mapper步驟。映射器的數量將取決於您的數據大小。

第二查詢

要點：

只有10行被選擇。
儘管只選擇了10行，但該過程需要掃描完整數據以便根據GROUP BY子句生成結果。
應該有3個mapper-reducer步驟。每個步驟中的映射器簡化器的數量將取決於數據大小。
- 1 MP將讀取的數據和應用WHERE條款
- 第二MR將是GROUP BY條款。
- 第三個MR將爲ORDER BY條款。

因此，通過您所提供的查詢可能看起來相似，但它們是完全不同的，一起解決不同的目的。

我希望這會幫助你。

來源

2017-02-15 06:44:01 Ambrish

使用Group By和Like Impala進行Impala查詢的性能下降

回答

相關問題