2015-10-22 43 views
0

我有一個HDFS夾幾個CSV文件,我加載到具有關係:如何加入標題行細節行中的多個文件與Apache豬

源= LOAD「$數據」 USING PigStorage(」 ,'); - $ data是作爲pig命令的參數傳遞的。

當我傾倒,源關係的結構如下:(注意,該數據是文本合格的,但我會使用替換函數處理這個)

("HEADER","20110118","20101218","20110118","T00002") 
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C") 
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B") 

<.... more records ....> 

("HEADER","20110224","20110109","20110224","T00002") 
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P") 
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P") 

<.... more records ....> 

所以每個文件都有一個它提供了一些關於它後面的數據集的信息,例如數據的提供者和它覆蓋的日期範圍。

所以,現在,我如何改變上述結構和創建類似下面的一個新的關係:

{ 
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..}, 
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples.. 
} 

凡每一頭元組之後記錄的元組的屬於該頭一個包嗎? 。 不幸的是,在頭部和細節行之間沒有公共的關鍵字段,所以我不認爲不能使用任何JOIN操作。 ?

我很新的豬和Hadoop,這是我搞的第一個概念的項目之一。

希望我的問題是清楚的,並期待着一些指導這裏。

回答

0

這應該讓你開始。
代碼:

Source = LOAD '$data' USING PigStorage(',','-tagFile'); 
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE; 
B = GROUP FileData BY $0; 
C = GROUP FileHeaders BY $0; 
D = JOIN B BY Group, C BY Group; 
... 
+0

是的,這肯定讓我開始,謝謝你! – rarpal