如果一個數字存在於一個字符串中，請將該字符串替換爲null - Spark

我是Spark-Scala的新手。我正在嘗試清理一些數據。我在清理FIRSTNAME和LASTNAME列時遇到了問題。字符串中有數字。如何識別數字並用空字符替換整個字符串。如果一個數字存在於一個字符串中，請將該字符串替換爲null - Spark

Consider the following dataframe: 

+---------+--------+ 
|FIRSTNAME|LASTNAME| 
+---------+--------+ 
| Steve| 10 C| 
|  Mark| 9436| 
| Brian| Lara| 
+---------+--------+ 

How do I get this: 

+---------+--------+ 
|FIRSTNAME|LASTNAME| 
+---------+--------+ 
| Steve| null| 
|  Mark| null| 
| Brian| Lara| 
+---------+--------+

任何幫助將不勝感激。非常感謝你！

編輯：

scala> df2.withColumn("LASTNAME_TEMP", when(col("LASTNAME").contains("1"), null).otherwise(col("LASTNAME"))).show() 
+---------+--------+-------------+ 
|FIRSTNAME|LASTNAME|LASTNAME_TEMP| 
+---------+--------+-------------+ 
| Steve| 10 C|   null| 
|  Mark| 9436|   9436| 
| Brian| Lara|   Lara| 
+---------+--------+-------------+

但上面的代碼將只在一個字符串。我寧願它拿一個字符串列表。例如：

val numList = List("1", "2", "3", "4", "5", "6", "7", "8", "9", "0")

我宣佈上述名單，並運行下面的代碼：

scala> df2.filter(col("LASTNAME").isin(numList:_*)).show()

我得到了以下數據框：

+---------+--------+ 
|FIRSTNAME|LASTNAME| 
+---------+--------+ 
+---------+--------+

來源

2017-06-12 ankursg8

你到目前爲止嘗試過什麼？執行你寫的代碼時遇到了什麼樣的問題？ – Dima

您可以使用正則表達式與rlike模式匹配：

val df = Seq(
    ("Steve", "10 C"), 
    ("Mark", "9436"), 
    ("Brian", "Lara") 
).toDF(
    "FIRSTNAME", "LASTNAME" 
) 

// Keep original LASTNAME in new column only if it doesn't consist of any digit 
val df2 = df.withColumn("LASTNAMEFIXED", when(! col("LASTNAME").rlike(".*[0-9]+.*"), col("LASTNAME"))) 

+---------+--------+-------------+ 
|FIRSTNAME|LASTNAME|LASTNAMEFIXED| 
+---------+--------+-------------+ 
| Steve| 10 C|   null| 
|  Mark| 9436|   null| 
| Brian| Lara|   Lara| 
+---------+--------+-------------+

來源

2017-06-12 23:10:51

非常感謝！這非常有用。如果你不介意，可以在上面的代碼中解釋'rlike（「。* [0-9] +。*」）'的作用。 – ankursg8

'rlike（「。* [0-9] +。*」）'會嘗試通過[正則表達式]（http://www.regular-expressions.info/）檢查列LASTNAME是否與包含at至少一位數字。 '。*'表示0個或多個任意字符，'[0-9] +'表示0到9之間的1個或多個數字。 –

明白了。謝謝！真的很感激它。 – ankursg8

如果一個數字存在於一個字符串中，請將該字符串替換爲null - Spark

回答

相關問題