星火 - 從行數據框中刪除特殊字符的不同列類型

假設我有一個數據幀包含多列，有些類型串其他類型INT和其他類型地圖。星火 - 從行數據框中刪除特殊字符的不同列類型

例如 場/列types: stringType|intType|mapType<string,int>|...

|-------------------------------------------------------------------------- 
| myString1  |myInt1| myMap1            |... 
|-------------------------------------------------------------------------- 
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|... 
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|... 
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|... 
|--------------------------------------------------------------------------

我想刪除像「_」和字符串和地圖類型的所有列「＃」所以結果據幀/ RDD會出現一些字符：

|------------------------------------------------------------------------ 
|myString1  |myInt1|  myMap1|...         | 
|------------------------------------------------------------------------ 
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|... 
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|... 
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|... 
|-------------------------------------------------------------------------

我不確定是否最好將Dataframe轉換爲RDD並使用它或在Dataframe中執行工作。

此外，不知道如何以最佳方式處理不同列類型的正則表達式（我唱歌斯卡拉）。我想執行此操作，這兩種類型（字符串和地圖）的所有列，儘量避免使用像列名：

def cleanRows(mytabledata: DataFrame): RDD[String] = { 

//this will do the work for a specific column (myString1) of type string 
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]","")) 

     ... 
//return type can be RDD or Dataframe... 
}

有沒有簡單的解決方案來執行呢？由於

來源

2017-03-16 Alg_D

一種選擇是，定義兩個UDF的處理字符串類型列和分別地圖類型柱：

import org.apache.spark.sql.functions.udf 
val df = Seq(("this_is#string", 3, Map("str1_in#map" -> 3))).toDF("myString", "myInt", "myMap") 
df.show 
+--------------+-----+--------------------+ 
|  myString|myInt|    myMap| 
+--------------+-----+--------------------+ 
|this_is#string| 3|Map(str1_in#map -...| 
+--------------+-----+--------------------+

1）UDF處理字符串型柱：

def remove_string: String => String = _.replaceAll("[_#]", "") 
def remove_string_udf = udf(remove_string)

2）UDF以處理地圖類型列：

def remove_map: Map[String, Int] => Map[String, Int] = _.map{ case (k, v) => k.replaceAll("[_#]", "") -> v } 
def remove_map_udf = udf(remove_map)

3）將udfs應用到相應的公司列清理它：

df.withColumn("myString", remove_string_udf($"myString")). 
    withColumn("myMap", remove_map_udf($"myMap")).show 

+------------+-----+-------------------+ 
| myString|myInt|    myMap| 
+------------+-----+-------------------+ 
|thisisstring| 3|Map(str1inmap -> 3)| 
+------------+-----+-------------------+

來源

2017-03-16 16:55:13 Psidom

嗨@Psidom，可能感謝您的提示。這似乎工作得很好，但是這種方式需要映射Dataframe中的所有列。正在尋找更多的通用用法。也許這不是那麼簡單..無論如何投票，tx –

您好@Psidom，通過任何改變是否有創建一個'def remove_map：字符串，T] =>地圖[字符串，T] ='我嘗試了幾種方法，但其中任何一個工作。這wuld避免有多個函數，每個組合'Map ' –

我不確定這是否可能。提出一個單獨的問題可能是值得的，看看你是否可以得出任何有意義的答案。 – Psidom

星火 - 從行數據框中刪除特殊字符的不同列類型

回答

相關問題