在Apache Spark中分割數據幀

使用帶有pyspark的Apache Spark 2.0，我有一個包含1000行數據的DataFrame，並且希望將該DataFrame分割/分割成2個獨立的DataFrame;在Apache Spark中分割數據幀

第一個數據幀應包含第750行
第二個數據幀應該包含剩餘的250行

注：隨機種子是不夠的，因爲我打算重複這種分裂方法多次，並希望控制哪些數據用於第一個和第二個DataFrame。

我發現take（n）方法對生成第一個結果很有用。
但我似乎無法找到正確的方式（或任何方式）獲取第二個DataFrame。

任何指針在正確的方向將不勝感激。

在此先感謝。

更新：我現在已經設法通過排序和再次應用take（n）來找到解決方案。這仍然感覺雖然次優解：

# First DataFrame, simply take the first 750 rows 
part1 = spark.createDataFrame(df.take(750)) 
# Second DataFrame, sort by key descending, then take 250 rows 
part2 = spark.createDataFrame(df.rdd.sortByKey(False).toDF().take(250)) 
# Then reverse the order again, to maintain the original order 
part2 = part2.rdd.sortByKey(True).toDF() 
# Then rename the columns as they have been reset to "_1" and "_2" by the sorting process 
part2 = part2.withColumnRenamed("_1", "label").withColumnRenamed("_2", "features")

來源

2016-08-15 kvdv

你是對使用取，因爲它繪製數據的驅動程序，然後重新分配createDataFrame它在集羣質疑。如果您的驅動程序沒有足夠的內存來存儲數據，則效率低下並可能失敗。

下面是創建一個行索引列和片上解決方案：

from pyspark.sql.functions import monotonicallyIncreasingId 

idxDf = df.withColumn("idx", monotonicallyIncreasingId()) 
part1 = idxDf.filter('idx < 750') 
part2 = idxDf.filter('idx >= 750')

來源

2016-08-15 23:18:11

在Apache Spark中分割數據幀

回答

相關問題