2012-07-02 55 views
2

在GenericUDAFCount.java:配置單元如何實現計數(獨特...)?

@Description(name = "count", 
value = "_FUNC_(*) - Returns the total number of retrieved rows, including " 
     +  "rows containing NULL values.\n" 

     + "_FUNC_(expr) - Returns the number of rows for which the supplied " 
     +  "expression is non-NULL.\n" 

     + "_FUNC_(DISTINCT expr[, expr...]) - Returns the number of rows for " 
     +  "which the supplied expression(s) are unique and non-NULL.") 

,但我不`噸看到任何代碼來處理「不同的」表情。

public static class GenericUDAFCountEvaluator extends GenericUDAFEvaluator { 
private boolean countAllColumns = false; 
private LongObjectInspector partialCountAggOI; 
private LongWritable result; 

@Override 
public ObjectInspector init(Mode m, ObjectInspector[] parameters) 
throws HiveException { 
    super.init(m, parameters); 
    partialCountAggOI = 
    PrimitiveObjectInspectorFactory.writableLongObjectInspector; 
    result = new LongWritable(0); 
    return PrimitiveObjectInspectorFactory.writableLongObjectInspector; 
} 

private GenericUDAFCountEvaluator setCountAllColumns(boolean countAllCols) { 
    countAllColumns = countAllCols; 
    return this; 
} 

/** class for storing count value. */ 
static class CountAgg implements AggregationBuffer { 
    long value; 
} 

@Override 
public AggregationBuffer getNewAggregationBuffer() throws HiveException { 
    CountAgg buffer = new CountAgg(); 
    reset(buffer); 
    return buffer; 
} 

@Override 
public void reset(AggregationBuffer agg) throws HiveException { 
    ((CountAgg) agg).value = 0; 
} 

@Override 
public void iterate(AggregationBuffer agg, Object[] parameters) 
    throws HiveException { 
    // parameters == null means the input table/split is empty 
    if (parameters == null) { 
    return; 
    } 
    if (countAllColumns) { 
    assert parameters.length == 0; 
    ((CountAgg) agg).value++; 
    } else { 
    assert parameters.length > 0; 
    boolean countThisRow = true; 
    for (Object nextParam : parameters) { 
     if (nextParam == null) { 
     countThisRow = false; 
     break; 
     } 
    } 
    if (countThisRow) { 
     ((CountAgg) agg).value++; 
    } 
    } 
} 

@Override 
public void merge(AggregationBuffer agg, Object partial) 
    throws HiveException { 
    if (partial != null) { 
    long p = partialCountAggOI.get(partial); 
    ((CountAgg) agg).value += p; 
    } 
} 

@Override 
public Object terminate(AggregationBuffer agg) throws HiveException { 
    result.set(((CountAgg) agg).value); 
    return result; 
} 

@Override 
public Object terminatePartial(AggregationBuffer agg) throws HiveException { 
    return terminate(agg); 
} 

}

如何蜂巢實現count(distinct ...)?當任務運行時,它確實花費了很多時間。 源代碼在哪裏?

回答

1

正如你可以運行SELECT DISTINCT列1 FROM表1,不同的表達不是一個標誌或選項,它的獨立評估

This page說:

勢必參數類型數據的實際過濾DISTINCT 的實現由框架處理,而不是COUNT UDAF 實現。

如果你想深入到源代碼的細節,看看到hive git repository