從Postgres DB讀取一個數組類型的火花數據幀

我的電腦上有一個本地PSQL數據庫。一些列將數據包含在數組中。（下面的例子）從Postgres DB讀取一個數組類型的火花數據幀

+--------------------+ 
|   _authors| 
+--------------------+ 
|[u'Miller, Roger ...| 
|[u'Noyes, H.Pierre']| 
|[u'Berman, S.M.',...| 
+--------------------+ 
only showing top 3 rows 

root 
|-- _authors: string (nullable = true)

我需要讀取它們作爲數組/包裝數組。我如何實現這一目標？

val sqlContext: SQLContext = new SQLContext(sc) 
val df_records = sqlContext.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/dbname") 
    .option("driver", "org.postgresql.Driver") 
    .option("dbtable", "public.records") 
    .option("user", "name") 
    .option("password", "pwd").load().select("_authors") 
df_records.printSchema()

我需要在我的管道的後期爆炸這個數組/扁平化。

感謝，

來源

2016-05-05 Krishna Kalyan

您是否試過向讀者添加'.schema（s：StructType）'？您必須將完整模式作爲StructType對象來傳遞 –

@DanieldePaula我找不到任何示例。你能否詳細說明一下？謝謝 –

我給你兩個建議的問題：

1）我不知道它的工作原理爲數組，但它是值得一試：這是可以定義一個特定的模式閱讀時來自源的數據幀。例如：

val customSchema = StructType(Seq(
    StructField("_authors", DataTypes.createArrayType(StringType), true), 
    StructField("int_column", IntegerType, true), 
    // other columns... 
)) 

val df_records = sqlContext.read 
    .format("jdbc") 
    .option("url", "jdbc:postgresql://localhost:5432/dbname") 
    .option("driver", "org.postgresql.Driver") 
    .option("dbtable", "public.records") 
    .option("user", "name") 
    .option("password", "pwd") 
    .schema(customSchema) 
    .load() 

df_records.select("_authors").show()

2）如果其他選項不起作用，此刻我只能想到定義解析UDF的：

val splitString: (String => Seq[String]) = { s => 
    val seq = s.split(",").map(i => i.trim).toSeq 

    // Remove "u[" from the first element and "]" from the last: 
    Seq(seq(0).drop(2)) ++ 
    seq.drop(1).take(seq.length-2) ++ 
     Seq(seq.last.take(seq.last.length-1)) 
} 

import org.apache.spark.sql.functions._ 
val newDF = df_records 
    .withColumn("authors_array", udf(splitString).apply(col("_authors")))

有關StructType更多詳情：org.apache.spark.sql.types.StructType
有關定義UDF的更多示例：this tutorial

來源

2016-05-06 11:44:35

從Postgres DB讀取一個數組類型的火花數據幀

回答

相關問題