如何在Scala udf中使用字符串數組作爲參數？

我的星火據幀（從蜂巢表創建）看起來像：如何在Scala udf中使用字符串數組作爲參數？

+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
|racist|filtered                                      | 
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+ 
|false |[rt, @dope_promo:, crew, beat, high, scores, fugly, frog, , https://time.com/sxp3onz1w8]                  | 
|false |[rt, @axolrose:, yall, call, kermit, frog, lizard?, , https://time.com/wdaeaer1ay]                    |

，我試圖從篩選字段中移除的網址。

我曾嘗試：

val regex = "(https?\\://)\\S+".r 

def removeRegex(input: Array[String]) : Array[String] = { 
    regex.replaceAllIn(input, "") 
} 

val removeRegexUDF = udf(removeRegex) 

filteredDF.withColumn("noURL", removeRegexUDF('filtered)).show

這給這個錯誤：

<console>:60: error: overloaded method value replaceAllIn with alternatives: 
    (target: CharSequence,replacer: scala.util.matching.Regex.Match => String)String <and> 
    (target: CharSequence,replacement: String)String 
cannot be applied to (Array[String], String) 
      regex.replaceAllIn(input, "") 
       ^

我在斯卡拉很大程度上是一個新手，所以任何指導，您可以就如何處理在過濾後陣列給udf非常感謝。（或者如果有更好的方法做到這一點，我很高興聽到它）。

來源

2017-06-30 schoon

您的輸入是一個字符串數組，但該方法只需要一個字符串，其中每個正則表達式的出現都被替換。 – Secespitus

這不是真的與spark相關，而是純粹的scala問題 –

我不會用空字符串替換的URL，而是將其刪除。這個UDF將訣竅：

val removeRegexUDF = udf(
    (input: Seq[String]) => input.filterNot(s => s.matches("(https?\\://)\\S+")) 
)

來源

2017-06-30 11:52:55

謝謝你做到了！ – schoon

我可以在s.matches位添加一個OR，如果它匹配（URL或其他），它會被刪除嗎？ – schoon

@schoon如果當然，我會這樣：'filterNot（s => s.matches（regex1）|| s.matches（regex2））' –

是的，你可以。

首先，而不是數組類型應該是Seq或WrappedArray。其次，函數只將一個字符串更改爲其他字符串 - 而不是集合。

你的UDF應該是：

def removeRegex(input: Seq[String]) : Array[String] = { 
    input.map(x => regex.replaceAllIn(x, "")).toArray 
}

所以地圖上把它應用正則表達式的每個元素。

您還可以使用功能regexp_replace從星火功能

來源

2017-06-30 11:17:48

謝謝。這給了我這個錯誤：：61：錯誤：類型不匹配; found：Seq [String] required：Array [String] input.map（regex.replaceAllIn（_，「」）） – schoon

@schoon您是否正在使用Scala正則表達式？ –

@schoon好的，現在它應該工作:) –

如何在Scala udf中使用字符串數組作爲參數？

回答

相關問題