如何在Python中聯合使用Spark SQL數據框

以下是創建數據框聯合的幾種方法，當我們談論大型數據框時，哪種方式（如果有的話）是最佳/推薦的？我應該首先創建一個空數據框還是連續創建第一個創建的數據框？如何在Python中聯合使用Spark SQL數據框

空數據幀創建

from pyspark.sql.types import StructType, StructField, IntegerType, StringType 

schema = StructType([ 
    StructField("A", StringType(), False), 
    StructField("B", StringType(), False), 
    StructField("C", StringType(), False) 
]) 

pred_union_df = spark_context.parallelize([]).toDF(schema)

方法1 - 聯盟，當您去：

for ind in indications: 
    fitted_model = get_fitted_model(pipeline, train_balanced_df, ind) 
    pred = get_predictions(fitted_model, pred_output_df, ind) 
    pred_union_df = pred_union_df.union(pred[['A', 'B', 'C']])

方法2 - 聯盟結尾：

all_pred = [] 
for ind in indications: 
    fitted_model = get_fitted_model(pipeline, train_balanced_df, ind) 
    pred = get_predictions(fitted_model, pred_output_df, ind) 
    all_pred.append(pred) 
pred_union_df = pred_union_df.union(all_pred)

還是我有這一切錯誤？

編輯： 方法2是不可能的，因爲我認爲它會從這個answer。我不得不遍歷列表併合並每個數據幀。

來源

2017-08-07 Pouya Yousefi

方法2總是首選，因爲它避免了長期血統問題。

雖然DataFrame.union只需要一個數據幀作爲參數，RDD.union確實take a list。根據您的示例代碼，您可以在調用toDF之前嘗試將它們合併。

如果你的數據在磁盤上，你也可以嘗試load them all at once實現聯合，例如，

dataframe = spark.read.csv([path1, path2, path3])

來源

2017-08-08 10:25:24 ShuaiYuan

如何在Python中聯合使用Spark SQL數據框

回答

相關問題