0
我有一個HDFS夾幾個CSV文件,我加載到具有關係:如何加入標題行細節行中的多個文件與Apache豬
源= LOAD「$數據」 USING PigStorage(」 ,'); - $ data是作爲pig命令的參數傳遞的。
當我傾倒,源關係的結構如下:(注意,該數據是文本合格的,但我會使用替換函數處理這個)
("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")
<.... more records ....>
("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")
<.... more records ....>
所以每個文件都有一個它提供了一些關於它後面的數據集的信息,例如數據的提供者和它覆蓋的日期範圍。
所以,現在,我如何改變上述結構和創建類似下面的一個新的關係:
{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}
凡每一頭元組之後記錄的元組的屬於該頭一個包嗎? 。 不幸的是,在頭部和細節行之間沒有公共的關鍵字段,所以我不認爲不能使用任何JOIN操作。 ?
我很新的豬和Hadoop,這是我搞的第一個概念的項目之一。
希望我的問題是清楚的,並期待着一些指導這裏。
是的,這肯定讓我開始,謝謝你! – rarpal