2014-08-28 93 views
0

我想通過url將我的數據分組。我的數據目前存儲在一個很長的行中。例如: { 「移動」, 「國家:美國」, 「網址:1234.com」, 「NEWUSER:Y」}等從Pig中提取一行

這是我到目前爲止有:

RAW = LOAD '/data/events/raw/2014-08-21/' as (line:chararray); 
A = FILTER RAW BY (INDEXOF(line,'mobile') != -1) 
B = LIMIT A 800; 
URL = GROUP B BY (INDEXOF(line, 'url')); 
STORE URL INTO '/user/hadoopuser/RS_traffic.txt'; 

如何我是否需要從字符串中提取網址才能進行分組?我可以使用正則表達式嗎?

+1

您的輸入看起來像JSON,你可以嘗試或使用負載JsonStorage http://pig.apache.org/docs/r0.10.0/ func.html#jsonloadstore – 2014-08-29 07:09:28

+0

這不是有效的JSON – 2014-09-10 07:53:15

回答

0

可以使用REGEX_EXTRACT()功能:

REGEX_EXTRACT Javadoc

RAW = LOAD '/data/events/*' AS (line:chararray); 
C = FOREACH RAW GENERATE REGEX_EXTRACT(value, '<your_pattern>', 1) AS url:chararray; 
A = FILTER RAW BY (INDEXOF(line,'mobile') != -1) 
URL = GROUP C BY url; 
.... 
STORE URL INTO '/user/hadoopuser/RS_traffic.txt';