2015-02-24 31 views
1

我想要計算一個子字符串的特定部分。我的A是正確的,但我在B正常工作時遇到麻煩。我包括實驗室的評論,以幫助解釋代碼的某些部分。生成一個子字符串(Apache Pig)的計數

data = LOAD '/dualcore/orders' AS (order_id:int, 
     cust_id:int, 
     order_dtm:chararray); 

/* 
    * Include only records where the 'order_dtm' field matches 
    * the regular expression pattern: 
    * 
    * ^  = beginning of string 
    * 2013 = literal value '2013' 
    * 0[2345] = 0 followed by 2, 3, 4, or 5 
    * -  = a literal character '-' 
    * \\d{2} = exactly two digits 
    * \\s  = a single whitespace character 
    * .*  = any number of any characters 
    * $  = end of string 
    * 
    * If you are not familiar with regular expressions and would 
    * like to know more about them, see the Regular Expression 
    * Reference at the end of the Exercise Manual. 
    */ 
recent = FILTER data by order_dtm matches '^2013-0[2345]-\\d{2}\\s.*$'; 

-- TODO (A): Create a new relation with just the order's year and month 
A = FOREACH data GENERATE SUBSTRING(order_dtm,0,7); 

-- TODO (B): Count the number of orders in each month 
B = FOREACH data GENERATE COUNT_STAR(A); 

-- TODO (C): Display the count by month to the screen. 
DUMP C;' 

回答

2

您可以通過兩種方式解決此問題。

選項1:你所說的

輸入

1  100  2013-02-15 test 
2  100  2013-04-20 test1 
1  101  2013-02-14 test2 
1  101  2014-02-27 test3 

PigScript使用字符串:

data = LOAD 'input' AS (order_id:int,cust_id:int,order_dtm:chararray); 
recent = FILTER data by order_dtm matches '^2013-0[2345]-\\d{2}\\s.*$'; 
A = FOREACH recent GENERATE order_id,cust_id,SUBSTRING(order_dtm,0,4) AS year,SUBSTRING(order_dtm,5,7) AS month; 
B = GROUP A BY month; 
C = FOREACH B GENERATE group AS month,FLATTEN(A.year) AS year,COUNT(A) AS cnt; 
DUMP C; 

輸出:

(02,2013,2) 
(02,2013,2) 
(04,2013,1) 

選項2:使用正則表達式功能

data = LOAD 'input' AS(order_id:int,cust_id:int,order_dtm:chararray); 
A = FOREACH data GENERATE order_id,cust_id,FLATTEN(REGEX_EXTRACT_ALL(order_dtm,'^(2013)-(0[2345])-\\d{2}\\s.*$')) AS (year,month); 
B = FILTER A BY month IS NOT NULL; 
C = GROUP B BY month; 
D = FOREACH C GENERATE group AS month,FLATTEN(B.year) AS year,COUNT(B) AS cnt; 
DUMP D; 

輸出:

(02,2013,2) 
(02,2013,2) 
(04,2013,1) 

在這兩種情況下,我已經包括今年也是在最終輸出的情況下,如果你不想再從腳本中刪除FLATTEN(year)