2016-07-17 60 views
0

我有兩列的表(代碼:chararray,SP:雙)柱分裂成使用豬拉丁組

我想基於像條件來劃分第二場SP分成不同的組(例如(< 25),(> 25 < 45),(> = 45)

INPUT

code sp 
t001 60.0 
t001 75.0 
a003 34.0 
t001 60.0 
a003 23.0 
a003 23.0 
t001 45.0 
t001 10.0 
t001 8.0 
a003 20.0 
t001 38.0 
a003 55.0 
a003 50.0 
t001 08.0 
a003 44.0 

期望的輸出:

code bin1  bin2  bin3 
     (<25) (>25 <45) >=45 
t001 3   1   4 
a003 3   2   2 

我想劇本象下面這樣:

data = LOAD 'Sandy/rd.csv' using PigStorage(',') As (code:chararray,sp:double); 

data2 = DISTINCT data; 

selfiltnew = FOREACH data2 generate code, sp; 
group_new = GROUP selfiltnew by (code,sp); 

newselt = FOREACH group_new GENERATE selfiltnew.code AS code,selfiltnew.sp AS sp; 

bin1 = filter newselt by sp < 25.0; 
grp1 = FOREACH bin1 GENERATE newselt.code AS code, COUNT(newselt.sp) AS (sp1:double); 

bin2 = filter newselt by sp < 45 and sp >= 25; 
grp2 = FOREACH bin3 GENERATE newselt.code AS code, COUNT(newselt.sp) AS (sp2:double); 

bin3 = filter newselt by sp >=75; 
grp3 = FOREACH bin3 GENERATE newselt.code AS code, COUNT(newselt.sp) AS (sp3:double); 

newbin = JOIN grp1 by code,grp2 by code,grp3 by code; 

newtable = FOREACH newbin GENERATE grp1::group.code AS code, SUM(sp1) AS bin1,SUM(sp2) AS bin2,SUM(sp3) AS bin3; 

data2 = FOREACH newtable GENERATE code, bin1, bin2, bin3; 
dump newtable; 

如何使用隱語我得到正確的輸出?

+1

請指明您的腳本有什麼問題,您得到的是什麼,而不是預期的結果 – YakovL

+1

@YakovL - 錯誤是在grp1 = FOREACH bin1 GENERATE newselt.code AS代碼,COUNT(newselt.sp)AS(sp1:double) ;在這裏,我想計算所有那些低於25的sp的計數。我得到以下錯誤:無法推斷org.apache.pig.builtin.COUNT的匹配函數爲多個或不匹配。請使用明確的演員。 – sandy

+0

我不確定,這個邏輯是否好。有沒有什麼最好的解決方案可以分成垃圾箱? – sandy

回答

0

通過觀察所需輸出沒有DISTINCT是必要的。也不需要執行您正在執行的一些步驟。請注意,如果電源是用空格隔開,你應該使用PigStorage(' ')代替PigStorage(',') 按照什麼@inquisitive_mind尖,代碼如下:

data = LOAD 'Sandy/rd.csv' using PigStorage(' ') As (code:chararray,sp:double); 
bin1 = filter data by sp < 25.0; 
grouped1 = GROUP bin1 by code; 
grp1 = FOREACH grouped1 GENERATE group AS code, COUNT(bin1.sp) AS (sp1:double); 
bin2 = filter data by (sp >= 25.0 AND sp<45); 
grouped2 = GROUP bin2 by code; 
grp2 = FOREACH grouped2 GENERATE group AS code, COUNT(bin2.sp) AS (sp2:double); 
bin3 = filter data by sp >= 45.0; 
grouped3 = GROUP bin3 by code; 
grp3 = FOREACH grouped3 GENERATE group AS code, COUNT(bin3.sp) AS (sp3:double); 
result= JOIN grp1 BY code, grp2 by code, grp3 by code; 
final_result = FOREACH result GENERATE grp1::code as code, grp1::sp1 as bin1, grp2::sp2 as bin2, grp3::sp3 as bin3; 

下面是輸出:

enter image description here

+0

感謝您的回覆。內部連接怎麼樣?如果grp1,grp2或grp3中的任何一個具有零itms,則最終輸出爲零。使用哪個連接?在我的實際數據集中,我差不多有6次,並且在任何組中都有零項的可能性。 – sandy

0

你必須使用COUNT

源之前使用GROUP BY:COUNT
用法
使用COUNT函數來計算在一個袋子元素的數量。 COUNT要求全局計數的前面的GROUP ALL語句和組計數的GROUP BY語句。

bin1 = filter newselt by sp < 25.0; 
grouped1 = GROUP bin1 by (newselt.code); 
grp1 = FOREACH grouped1 GENERATE group AS code, COUNT(newselt.sp) AS (sp1:double); 
+0

感謝您的回覆。 – sandy