我試圖計算具有ID列表的文件中重複且具有計數> 1的成員身份數的數目。我跑了以下,但有1個值我認爲這是隻是計算在MEMBERID列的行數:如何統計PIG中的重複值
ids = load 'ids';
ids = filter ids by id;
group = group ids ALL;
count = foreach group generate count (ids);
dump count;
我試圖計算具有ID列表的文件中重複且具有計數> 1的成員身份數的數目。我跑了以下,但有1個值我認爲這是隻是計算在MEMBERID列的行數:如何統計PIG中的重複值
ids = load 'ids';
ids = filter ids by id;
group = group ids ALL;
count = foreach group generate count (ids);
dump count;
我假設該文件是製表符分隔。
A = LOAD '/test.txt' USING PigStorage('\t') AS (id:int,create_dt:chararray);
B = FILTER A BY (id > 1 and DaysBetween(CurrentTime(),ToDate(create_dt)) == 30);
C = GROUP B BY id;
D = FOREACH C GENERATE group as id,COUNT(B) as totalcount;
DUMP D;
其實我的文件有2列,一個id列和一個created列。我如何計算從今天起30天內創建日期的id> 1的數量? – Tai