2015-10-07 20 views
2

我正在使用CDH-5.4.4 Cloudera Edition,我在HDFS位置有一個CSV文件,我的要求是在Hadoop Environement(OLTP)上執行實時SQL查詢。Impala - 獲取多個不同值的錯誤

因此,我決定與Impala一起創建MetaStore表格到CSV文件,然後在impala編輯器(在HUE應用程序中)執行查詢。

當我執行下面的查詢,我得到錯誤,如

「AnalysisException:所有不同的聚合函數需要有 相同的一組參數計數(DISTINCT市);偏離功能: 計數(DISTINCT國家)「。

CSV File 

OrderID,CustomerID,City,Country 
Ord01,Cust01,Aachen,Germany 
Ord02,Cust01,Albuquerque,USA 
Ord03,Cust01,Aachen,Germany 
Ord04,Cust02,Arhus,Denmark 
Ord05,Cust02,Arhus,Denmark 

Problamatic Query 

Select CustomerID,Count(Distinct City),Count(Distinct Country) From CustomerOrders Group by CustomerID 

問題:

無法在查詢與多個不同的值來執行帕拉查詢..我已搜查了互聯網,他們提供NDV()方法作爲一種解決方法,但NDV方法僅返回不同值的近似計數,我需要針對多個字段的精確唯一計數。

後市展望:

什麼是超過一個領域做到精確獨特的計數的最佳方式?請修改上述查詢以使用Impala。

說明:這不是我原來的表,我有複製爲論壇問題。

回答

1

我在Impala中遇到同樣的問題。這裏是我的解決方法:

SELECT CustomerID 
    ,sum(nr_of_cities) 
    ,sum(nr_of_countries) 
FROM (
    SELECT CustomerID 
     ,Count(DISTINCT City) AS nr_of_cities 
     ,0 AS nr_of_countries 
    FROM CustomerOrders 
    GROUP BY CustomerID 

    UNION ALL 

    SELECT CustomerID 
     ,0 AS nr_of_cities 
     ,Count(DISTINCT Country) AS nr_of_countries 
    FROM CustomerOrders 
    GROUP BY CustomerID 
) AS aa 
GROUP BY CustomerID 
0

,我認爲這是可以做到清潔(未經測試):

WITH 
countries AS 
(
SELECT CustomerID 
     ,COUNT(DISTINCT City) AS nr_of_countries 
FROM CustomerOrders 
GROUP BY 1 
) 
, 
cities AS 
(
SELECT CustomerID 
     ,COUNT(DISTINCT City) AS nr_of_cities 
FROM CustomerOrders 
GROUP BY 1 
) 
SELECT CustomerID 
     ,nr_of_cities 
     ,nr_of_countries 
FROM cities INNER JOIN countries USING (CustomerID)