您應該能夠創建一個直接指向雲存儲數據的外部表。這應該與配置單元和Spark SQL一起工作。在很多情況下,這可能是最好的策略。
以下是基於雲存儲中公共數據集的示例。
CREATE EXTERNAL TABLE natality_csv (
source_year BIGINT, year BIGINT, month BIGINT, day BIGINT, wday BIGINT,
state STRING, is_male BOOLEAN, child_race BIGINT, weight_pounds FLOAT,
plurality BIGINT, apgar_1min BIGINT, apgar_5min BIGINT,
mother_residence_state STRING, mother_race BIGINT, mother_age BIGINT,
gestation_weeks BIGINT, lmp STRING, mother_married BOOLEAN,
mother_birth_state STRING, cigarette_use BOOLEAN, cigarettes_per_day BIGINT,
alcohol_use BOOLEAN, drinks_per_week BIGINT, weight_gain_pounds BIGINT,
born_alive_alive BIGINT, born_alive_dead BIGINT, born_dead BIGINT,
ever_born BIGINT, father_race BIGINT, father_age BIGINT,
record_weight BIGINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'gs://public-datasets/natality/csv'
無可否認,基於您的問題的評論,我不確定是否缺少您問題的另一部分。
我現在意識到,我可以使用'location'gs://''直接在雲存儲中指定位置。儘管如此,我的問題的第一部分仍然存在。 –
femibyte