避免Spark窗口函數中單個分區模式的性能影響

我的問題由計算火花數據幀中連續行之間差異的用例觸發。避免Spark窗口函數中單個分區模式的性能影響

例如，我有：

>>> df.show() 
+-----+----------+ 
|index|  col1| 
+-----+----------+ 
| 0.0|0.58734024| 
| 1.0|0.67304325| 
| 2.0|0.85154736| 
| 3.0| 0.5449719| 
+-----+----------+

如果我選擇來計算這些使用「窗口」功能，那麼我就可以做到這一點，像這樣：

>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc()) 
>>> import pyspark.sql.functions as f 
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show() 
+-----+----------+-----------+ 
|index|  col1| diffs_col1| 
+-----+----------+-----------+ 
| 0.0|0.58734024|0.085703015| 
| 1.0|0.67304325| 0.17850411| 
| 2.0|0.85154736|-0.30657548| 
| 3.0| 0.5449719|  null| 
+-----+----------+-----------+

問題：我明確將數據幀分區到一個分區中。這對性能的影響是什麼，如果有的話，爲什麼是這樣以及如何避免它？因爲當我不指定分區，我得到以下警告：

16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

來源

2016-12-24 Ytsen de Boer

在實際性能的影響將是幾乎一樣的，如果你省略partitionBy條款都沒有。所有記錄都會被混洗到一個單獨的分區，在本地排序並逐一依次迭代。

區別只在於創建的分區總數。讓我們用簡單的數據集與10個分區和1000個記錄表明，與一個例子：

df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))

如果您by子句定義框架不分區

w_unpart = Window.orderBy(f.col("index").asc())

與lag

df_lag_unpart = df.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1") 
)

使用總共只有一個分區：

df_lag_unpart.rdd.glom().map(len).collect()

[1000]

與用啞指標該幀定義（簡化的比特相比，您的代碼：

w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())

將使用等於分區數到spark.sql.shuffle.partitions：

spark.conf.set("spark.sql.shuffle.partitions", 11) 

df_lag_part = df.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1") 
) 

df_lag_part.rdd.glom().count()

與只有一個非空分區：

df_lag_part.rdd.glom().filter(lambda x: x).count()

不幸的是，它可以用來在PySpark解決這個問題沒有通用的解決方案。這只是實現的一種內在機制，與分佈式處理模型相結合。

由於index列是連續的，你可以產生人工分區鍵與固定數量的每塊的記錄：

rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions")) 

df_with_block = df.withColumn(
    "block", (f.col("index")/rec_per_block).cast("int") 
)

，並用它來定義幀規定：

w_with_block = Window.partitionBy("block").orderBy("index") 

df_lag_with_block = df_with_block.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1") 
)

這將使用預期數量分區：

df_lag_with_block.rdd.glom().count()

與大致均勻數據分佈（我們無法避免哈希衝突）：

df_lag_with_block.rdd.glom().map(len).collect()

[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]

但對塊邊界的多項空白：

df_lag_with_block.where(f.col("diffs_col1").isNull()).count()

由於邊界很容易計算：

from itertools import chain 

boundary_idxs = sorted(chain.from_iterable(
    # Here we depend on sequential identifiers 
    # This could be generalized to any monotonically increasing 
    # id by taking min and max per block 
    (idx - 1, idx) for idx in 
    df_lag_with_block.groupBy("block").min("index") 
     .drop("block").rdd.flatMap(lambda x: x) 
     .collect()))[2:] # The first boundary doesn't carry useful inf.

你可以隨時選擇：

missing = df_with_block.where(f.col("index").isin(boundary_idxs))

並分別填補這些：

# We use window without partitions here. Since number of records 
# will be small this won't be a performance issue 
# but will generate "Moving all data to a single partition" warning 
missing_with_lag = missing.withColumn(
    "diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1") 
).select("index", f.col("diffs_col1").alias("diffs_fill"))

和join：

combined = (df_lag_with_block 
    .join(missing_with_lag, ["index"], "leftouter") 
    .withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))

得到期望的結果：

mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
    combined["diffs_col1"] != df_lag_unpart["diffs_col1"] 
) 
assert mismatched.count() == 0

來源

2016-12-24 19:29:29 user6910411

避免Spark窗口函數中單個分區模式的性能影響

回答

相關問題