2016-03-15 50 views
1

在完整連接豬拉丁文中需要捨棄空值的幫助。下面是兩組數據:在PIG完全外連接後捨去空值

答:

(BOS,2) 
(BUR,81) 
(LAS,8) 

B:

(BUR,56) 
(EWR,2) 
(LAS,88) 

全外後加入: C:

(BOS,2,,) 
(BUR,81,BUR,56) 
(,,EWR,2) 
(LAS,8,LAS,88) 

我需要在輸出格式如下:

(BOS,2) 
(BUR,137) 
(EWR,2) 
(LAS,96) 

嘗試了不同的組合,平鋪,bagtotuple ......但無法找出解決方案。非常感謝您的幫助。

airline = load '/demo/data/airline/airline.csv' using PigStorage(',') as (Origin: chararray, Dest: chararray); 
traffic_in = GROUP airline by Origin; 
traffic_in_count= FOREACH traffic_in generate group as Origin , COUNT(airline) as count ; 
traffic_out = GROUP airline by Dest; 
traffic_out_count = FOREACH traffic_out generate group as Dest ,COUNT (airline) as count; 
traffic_top = JOIN traffic_in_count by Origin FULL OUTER , traffic_out_count by Dest ; 
+0

請分享你豬腳本。似乎你可以使用cogroup,所以SUM - 你嘗試過嗎? – Mzf

+0

airline = load'/demo/data/airline/airline.csv'使用PigStorage(',') as(Origin:chararray,Dest:chararray); \t \t \t \t \t \t \t \t traffic_in = GROUP用Origin航空公司; traffic_in_count = FOREACH traffic_in生成組爲原產地,COUNT(航空公司)爲計數; traffic_out =通過目的地的GROUP航空公司; traffic_out_count = FOREACH traffic_out生成組爲Dest,COUNT(航空公司)爲計數; traffic_top =通過Origin加入traffic_in_count FULL OUTER,Dest的traffic_out_count; ---請原諒我,無法格式化代碼 - – Fasahat

+0

以上是我的實際代碼,替換了問題中的別名。 – Fasahat

回答

0

EDIT 代替使用外部聯接使用UNION然後SUM第2列的值。

A = LOAD 'test1.txt' using PigStorage(',') as (A1:chararray, A2:int); 
B = LOAD 'test2.txt' using PigStorage(',') as (B1:chararray, B2:int); 
C = UNION A,B; 
D = GROUP C BY $0; 
E = FOREACH D GENERATE group,SUM(C.$1); 
DUMP E; 

輸出

Total

+0

上述過濾器選項...相同的輸出結果未獲得。 – Fasahat

+0

@Fasahat我已經更新了答案。而不是加入你可以聯合並獲得總數。 –