如何根據Pyspark中的數據框中的條件設置新的列表值？

我有一個像下面的DataFrame。如何根據Pyspark中的數據框中的條件設置新的列表值？

+---+------------------------------------------+ 
|id |features         | 
+---+------------------------------------------+ 
|1 |[6.629056, 0.26771536, 0.79063195,0.8923] | 
|2 |[1.4850719, 0.66458416, -2.1034079]  | 
|3 |[3.0975454, 1.571849, 1.9053307]   | 
|4 |[2.526619, -0.33559006, -1.4565022]  | 
|5 |[-0.9286196, -0.57326394, 4.481531]  | 
|6 |[3.594114, 1.3512149, 1.6967168]   | 
+---+------------------------------------------+

我想設置一些我的功能的價值根據我的地方如下條件。即其中id=1,id=2或id=6。

我想設置新功能值，其中id=1，我目前的功能值是[6.629056, 0.26771536, 0.79063195,0.8923]，但我想設置[0,0,0,0]。

我想設置新的功能值，其中id=2，我目前的功能值是[1.4850719, 0.66458416, -2.1034079]，但我想設置[0,0,0]。

我最後出來放將是：

+------+-----------------------------------+ 
|id | features       | 
+-----+---------------------------------- -+ 
|1 | [0, 0, 0, 0]       | 
|2 | [0,0,0]        | 
|3 | [3.0975454, 1.571849, 1.9053307]  | 
|4 | [2.526619, -0.33559006, -1.4565022] | 
|5 | [-0.9286196, -0.57326394, 4.481531] | 
|6 | [0,0,0]        | 
+-----+------------------------------------+

來源

2017-12-18 sai kumar

如果您有一套有限的id，Shaido的答案沒問題，您也知道相應的feature的長度。

如果不是的話，它應該是清潔使用UDF，並要能夠在另一個Seq加載轉換id S：

在斯卡拉

val arr = Seq(1,2,6) 

val fillArray = udf { (id: Int, array: WrappedArray[Double]) => 
         if (arr.contains(id)) Seq.fill[Double](array.length)(0.0) 
         else array 
        } 

df.withColumn("new_features" , fillArray($"id", $"features")).show(false)

在Python中

from pyspark.sql import functions as f 
from pyspark.sql.types import * 

arr = [1,2,6] 

def fillArray(id, features): 
    if(id in arr): return [0.0] * len(features) 
    else : return features 

fill_array_udf = f.udf(fillArray, ArrayType(DoubleType())) 

df.withColumn("new_features" , fill_array_udf(f.col("id"), f.col("features"))).show()

輸出

+---+------------------------------------------+-----------------------------------+ 
|id |features         |new_features      | 
+---+------------------------------------------+-----------------------------------+ 
|1 |[6.629056, 0.26771536, 0.79063195, 0.8923]|[0.0, 0.0, 0.0, 0.0]    | 
|2 |[1.4850719, 0.66458416, -2.1034079]  |[0.0, 0.0, 0.0]     | 
|3 |[3.0975454, 1.571849, 1.9053307]   |[3.0975454, 1.571849, 1.9053307] | 
|4 |[2.526619, -0.33559006, -1.4565022]  |[2.526619, -0.33559006, -1.4565022]| 
|5 |[-0.9286196, -0.57326394, 4.481531]  |[-0.9286196, -0.57326394, 4.481531]| 
|6 |[3.594114, 1.3512149, 1.6967168]   |[0.0, 0.0, 0.0]     | 
+---+------------------------------------------+-----------------------------------+

來源

2017-12-18 13:45:53 philantrovert

我覺得OP想要pyspark代碼 – mtoto

嗨Philantrovert，感謝您的快速回復，但我期待在python –

答案我的不好。我沒有正確地閱讀這個問題。現在更新。 – philantrovert

使用when和otherwise如果你有一個小集ID的改變：

df.withColumn("features", 
    when(df.id === 1, array(lit(0), lit(0), lit(0), lit(0))) 
    .when(df.id === 2 | df.id === 6, array(lit(0), lit(0), lit(0))) 
    .otherwise(df.features)))

應該比UDF但如果快有很多ID很快就會變成很多代碼。在這種情況下，請按照philantrovert的回答使用UDF。

來源

2017-12-18 13:25:59 Shaido

這是不是一個真正的可擴展的方法 – mtoto

@mtoto：是的，由於某種原因，我認爲有3只應改變不同的ID。正如我在答案中所寫的那樣，如果有不止幾個，那麼philantrovert的答案更可取。 – Shaido

如何根據Pyspark中的數據框中的條件設置新的列表值？

回答

相關問題