使用Apache將文件拆分成4個等份Pig豬

我想使用Apache pig將文件拆分爲4個等份。例如，如果一個文件有100行，則前25個應該轉到第1個輸出文件等等，最後25行應該轉到第4個輸出文件。有人可以幫助我實現這一目標嗎？我正在使用Apache pig，因爲文件中的記錄數量將以百萬爲單位，並且以前的步驟會生成需要分割的文件，並使用Pig。使用Apache將文件拆分成4個等份Pig豬

來源

2015-09-24 user3072054

這可以。但可能會有更好的選擇。

A = LOAD 'file' using PigStorage() as (line:chararray); 
B = RANK A; 
C = FILTER B BY rank_A > 1 and rank_A <= 25; 
D = FILTER B BY rank_A > 25 and rank_A <= 50; 
E = FILTER B BY rank_A > 50 and rank_A <= 75; 
F = FILTER B BY rank_A > 75 and rank_A <= 100; 
store C into 'file1'; 
store D into 'file2'; 
store E into 'file3'; 
store F into 'file4';

來源

2015-09-24 08:00:45

謝謝Vignesh ..問題是我不知道有多少記錄我會在輸入文件中，它將從幾千到幾百萬的任何地方.. – user3072054

您可以使用下面的一些PIG功能來實現您想要的結果。

SPLIT功能http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
MultiStorage類：https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/piggybank/storage/MultiStorage.html
編寫自定義PIG存儲：https://pig.apache.org/docs/r0.7.0/udf.html#Store+Functions

你必須提供基於數據的一些條件。

來源

2015-09-24 15:34:20 pradeep

我的要求有所改變，我只需要將前25％的數據存儲到一個文件中，其餘的文件存儲到另一個文件中。這是爲我工作的豬腳本。

ip_file = LOAD 'input file' using PigStorage('|'); 
rank_file = RANK ip_file by $2; 
rank_group = GROUP rank_file ALL; 
with_max = FOREACH rank_group GENERATE COUNT(rank_file),FLATTEN(rank_file); 
top_file = filter with_max by $1 <= $0/4; 
rest_file = filter with_max by $1 > $0/4; 
sort_top_file = order top_file by $1 parallel 1; 
store sort_top_file into 'output file 1' using PigStorage('|'); 
store rest_file into 'output file 2 using PigStorage('|');

來源

2015-09-24 20:03:00 user3072054

我做了一點挖掘，因爲它出現了Hadoop的Hadoop樣本考試。它似乎沒有很好的記錄 - 但它非常簡單。在這個例子中，我使用的下載提供對dev.mysql.com國家樣本數據庫：

grunt> storeme = order data by $0 parallel 3; 
grunt> store storeme into '/user/hive/countrysplit_parallel';

那麼，如果我們在HDFS看看目錄：

[[email protected] arthurs_stuff]# hadoop fs -ls /user/hive/countrysplit_parallel 
Found 4 items 
-rw-r--r-- 3 hive hdfs   0 2016-04-08 10:19 /user/hive/countrysplit_parallel/_SUCCESS 
-rw-r--r-- 3 hive hdfs  3984 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00000 
-rw-r--r-- 3 hive hdfs  4614 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00001 
-rw-r--r-- 3 hive hdfs  4768 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00002

希望有所幫助。

來源

2016-04-08 10:30:15

偉大的解決方案！ –

不知何故非常有幫助！ – ChikuMiku

使用Apache將文件拆分成4個等份Pig豬

回答

相關問題