2016-11-15 81 views
0

對於熊貓我有一個代碼段是這樣的:火花條件替換值的

def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'): 
    df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet 

其中有條件將在數據幀替換值。

試圖端口此功能引發

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show 

並沒有爲我工作了

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show 
warning: there was one feature warning; re-run with -feature for details 
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;; 

即使df.printSchema返回一個字符串和b

什麼是錯的這裏?

編輯

的最小例如:

import java.sql.{ Date, Timestamp } 
case class FooBar(foo:Date, bar:String) 
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate")) 
     .toDF("foo","bar") 
     .withColumn("foo", 'foo.cast("Date")) 
     .as[FooBar] 

myDf.printSchema 
root 
|-- foo: date (nullable = true) 
|-- bar: string (nullable = true) 


scala> myDf.show 
+----------+--------------------+ 
|  foo|     bar| 
+----------+--------------------+ 
|2016-01-01|    first| 
|2016-01-02|    second| 
|  null|  noValidFormat| 
|2016-01-04|lastAssumingSameDate| 
+----------+--------------------+ 

myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show 

而在條件情況下鏈接的預期輸出

+----------+--------------------+ 
|  foo|     bar| 
+----------+--------------------+ 
|2016-01-01|    first| 
|2016-01-02|    second| 
| "noValue"|  noValidFormat| 
|2016-01-04|lastAssumingSameDate| 
+----------+--------------------+ 

EDIT2

需要

df 
    .withColumn("A", 
     when(
     (($"B" === "x") and ($"B" isNull)) or 
     (($"B" === "y") and ($"B" isNull)), "replacement") 

應該工作

+0

請分享示例數據和預期輸出 – mtoto

+0

@mtoto請參閱編輯。 –

回答

2

注意運算符的優先級。它應該是:

myDf.withColumn("foo", 
    when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue")) 

此:

$"bar" === "noValidFormat" and $"foo" isNull 

被評價爲:

(($"bar" === "noValidFormat") and $"foo") isNull 
+0

奇怪這仍然存在:警告:有四個功能警告;詳細信息請使用-feature運行 –

+0

http://www.scala-lang.org/api/current/scala/languageFeature$$postfixOps$.html – 2016-11-15 13:01:01

+0

我會檢查一下。 (($「A」===「y」)和($「B」isNull),「R」))時,我會鏈接多個這樣的語句,比如'dfFixedAge .withColumn(「C」, ) .withColumn(「C」, 當(($「A」===「x」)和($「B」isNull),「R」))'只有最後一個持續存在。 –