Spark：保存由「虛擬」列分區的DataFrame

我使用PySpark做經典的ETL作業（加載數據集，處理它，保存它）並且想要將我的Dataframe保存爲由「虛擬」列分區的文件/目錄;我的意思是「虛擬」是我有一個列時間戳，這是一個包含ISO 8601編碼日期的字符串，我想按年/月/日進行分區;但我實際上並沒有DataFrame中的Year，Month或Day列;我有這個時間戳，我可以派生這些列，但我不希望我的結果項目有這些列中的一個序列化。Spark：保存由「虛擬」列分區的DataFrame

從數據幀保存到磁盤，應該像產生的文件結構：

/ 
    year=2016/ 
     month=01/ 
      day=01/ 
       part-****.gz

有沒有辦法做我想做的星火/ Pyspark？

來源

2016-02-16 arnaud briche

用於分區的列不包含在序列化數據本身中。例如，如果你創建DataFrame這樣的：

df = sc.parallelize([ 
    (1, "foo", 2.0, "2016-02-16"), 
    (2, "bar", 3.0, "2016-02-16") 
]).toDF(["id", "x", "y", "date"])

，並寫出如下：

import tempfile 
from pyspark.sql.functions import col, dayofmonth, month, year 
outdir = tempfile.mktemp() 

dt = col("date").cast("date") 
fname = [(year, "year"), (month, "month"), (dayofmonth, "day")] 
exprs = [col("*")] + [f(dt).alias(name) for f, name in fname] 

(df 
    .select(*exprs) 
    .write 
    .partitionBy(*(name for _, name in fname)) 
    .format("json") 
    .save(outdir))

單個文件不包含分區列：

import os 

(sqlContext.read 
    .json(os.path.join(outdir, "year=2016/month=2/day=16/")) 
    .printSchema()) 

## root 
## |-- date: string (nullable = true) 
## |-- id: long (nullable = true) 
## |-- x: string (nullable = true) 
## |-- y: double (nullable = true)

分區數據只存儲在目錄結構中，而不是在序列化文件中重複。它只會在您讀取完整或部分目錄樹時附加：

sqlContext.read.json(outdir).printSchema() 

## root 
## |-- date: string (nullable = true) 
## |-- id: long (nullable = true) 
## |-- x: string (nullable = true) 
## |-- y: double (nullable = true) 
## |-- year: integer (nullable = true) 
## |-- month: integer (nullable = true) 
## |-- day: integer (nullable = true) 

sqlContext.read.json(os.path.join(outdir, "year=2016/month=2/")).printSchema() 

## root 
## |-- date: string (nullable = true) 
## |-- id: long (nullable = true) 
## |-- x: string (nullable = true) 
## |-- y: double (nullable = true) 
## |-- day: integer (nullable = true)

來源

2016-02-17 06:37:15 zero323

我是新來的python。有沒有辦法做到這一點，而沒有年= =，月=，和日=在路徑中？我明白這一點 – deanw

Spark：保存由「虛擬」列分區的DataFrame

回答

相關問題