如何使用hive/spark-sql生成大型數據集？

創建分區種子表

create table seed (i int) 
partitioned by (p int)

填充種子表0和999
每個記錄被插入到不同的分區之間1K記錄用連續號碼，因此位於上一個不同的HDFS目錄和更重要的 - 在不同的文件上。

P.s.

以下一組需要

set hive.exec.dynamic.partition.mode=nonstrict; 
set hive.exec.max.dynamic.partitions.pernode=1000; 
set hive.hadoop.supports.splittable.combineinputformat=false; 
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

insert into table seed partition (p) 
select i,i 
from (select 1) x lateral view posexplode (split (space (999),' ')) e as i,x

生成表1G記錄。
種子表中的每個1K記錄都在不同的文件上，並且正在被不同的容器讀取。
每個容器都會生成1M記錄。

create table t1g 
as 
select s.i*1000000 + e.i + 1 as n 
from seed s lateral view posexplode (split (space (1000000-1),' ')) e as i,x

來源

2017-03-05 13:39:28

明智的做法 –

@PraveenKumarKrishnaiyer - 謝謝:-) –

如何使用hive/spark-sql生成大型數據集？

回答

相關問題