0

我有一個像下面的DataFrame。如何根據Pyspark中的數據框中的條件設置新的列表值?

+---+------------------------------------------+ 
|id |features         | 
+---+------------------------------------------+ 
|1 |[6.629056, 0.26771536, 0.79063195,0.8923] | 
|2 |[1.4850719, 0.66458416, -2.1034079]  | 
|3 |[3.0975454, 1.571849, 1.9053307]   | 
|4 |[2.526619, -0.33559006, -1.4565022]  | 
|5 |[-0.9286196, -0.57326394, 4.481531]  | 
|6 |[3.594114, 1.3512149, 1.6967168]   | 
+---+------------------------------------------+ 

我想設置一些我的功能的價值根據我的地方如下條件。即其中id=1,id=2id=6

我想設置新功能值,其中id=1,我目前的功能值是[6.629056, 0.26771536, 0.79063195,0.8923],但我想設置[0,0,0,0]

我想設置新的功能值,其中id=2,我目前的功能值是[1.4850719, 0.66458416, -2.1034079],但我想設置[0,0,0]

我最後出來放將是:

+------+-----------------------------------+ 
|id | features       | 
+-----+---------------------------------- -+ 
|1 | [0, 0, 0, 0]       | 
|2 | [0,0,0]        | 
|3 | [3.0975454, 1.571849, 1.9053307]  | 
|4 | [2.526619, -0.33559006, -1.4565022] | 
|5 | [-0.9286196, -0.57326394, 4.481531] | 
|6 | [0,0,0]        | 
+-----+------------------------------------+ 

回答

3

如果您有一套有限的id,Shaido的答案沒問題,您也知道相應的feature的長度。

如果不是的話,它應該是清潔使用UDF,並要能夠在另一個Seq加載轉換id S:

在斯卡拉

val arr = Seq(1,2,6) 

val fillArray = udf { (id: Int, array: WrappedArray[Double]) => 
         if (arr.contains(id)) Seq.fill[Double](array.length)(0.0) 
         else array 
        } 

df.withColumn("new_features" , fillArray($"id", $"features")).show(false) 

在Python中

from pyspark.sql import functions as f 
from pyspark.sql.types import * 

arr = [1,2,6] 

def fillArray(id, features): 
    if(id in arr): return [0.0] * len(features) 
    else : return features 

fill_array_udf = f.udf(fillArray, ArrayType(DoubleType())) 

df.withColumn("new_features" , fill_array_udf(f.col("id"), f.col("features"))).show() 

輸出

+---+------------------------------------------+-----------------------------------+ 
|id |features         |new_features      | 
+---+------------------------------------------+-----------------------------------+ 
|1 |[6.629056, 0.26771536, 0.79063195, 0.8923]|[0.0, 0.0, 0.0, 0.0]    | 
|2 |[1.4850719, 0.66458416, -2.1034079]  |[0.0, 0.0, 0.0]     | 
|3 |[3.0975454, 1.571849, 1.9053307]   |[3.0975454, 1.571849, 1.9053307] | 
|4 |[2.526619, -0.33559006, -1.4565022]  |[2.526619, -0.33559006, -1.4565022]| 
|5 |[-0.9286196, -0.57326394, 4.481531]  |[-0.9286196, -0.57326394, 4.481531]| 
|6 |[3.594114, 1.3512149, 1.6967168]   |[0.0, 0.0, 0.0]     | 
+---+------------------------------------------+-----------------------------------+ 
+0

我覺得OP想要pyspark代碼 – mtoto

+0

嗨Philantrovert,感謝您的快速回復,但我期待在python –

+0

答案我的不好。我沒有正確地閱讀這個問題。現在更新。 – philantrovert

1

使用whenotherwise如果你有一個小集ID的改變:

df.withColumn("features", 
    when(df.id === 1, array(lit(0), lit(0), lit(0), lit(0))) 
    .when(df.id === 2 | df.id === 6, array(lit(0), lit(0), lit(0))) 
    .otherwise(df.features))) 

應該比UDF但如果快有很多ID很快就會變成很多代碼。在這種情況下,請按照philantrovert的回答使用UDF

+0

這是不是一個真正的可擴展的方法 – mtoto

+0

@mtoto:是的,由於某種原因,我認爲有3只應改變不同的ID。正如我在答案中所寫的那樣,如果有不止幾個,那麼philantrovert的答案更可取。 – Shaido

相關問題