apache spark sql數據框過濾器按字符串多規則

我會使用火花數據框來搜索內容'喜歡' ，我們可以使用'或'功能像sql'||'像這樣過濾。apache spark sql數據框過濾器按字符串多規則

voc_0201.filter(col("contents").like("intel").or(col("contents").like("apple"))).count

但我必須過濾很多字符串，我怎麼能過濾字符串列表或數組到數據框？

感謝

來源

2016-05-25 benchuang

讓我們先來定義我們的patterns：

val patterns = Seq("foo", "bar")

，並創建一個例子DataFrame：

val df = Seq((1, "bar"), (2, "foo"), (3, "xyz")).toDF("id", "contents")

一個簡單的解決方案是在foldpatterns：

val expr = patterns.foldLeft(lit(false))((acc, x) => 
    acc || col("contents").like(x) 
) 

df.where(expr).show 

// +---+--------+ 
// | id|contents| 
// +---+--------+ 
// | 1|  bar| 
// | 2|  foo| 
// +---+--------+

另一個是建立正則表達式，並使用rlike：

val expr = patterns.map(p => s"^$p$$").mkString("|") 
df.where(col("contents").rlike(expr)).show 

// +---+--------+ 
// | id|contents| 
// +---+--------+ 
// | 1|  bar| 
// | 2|  foo| 
// +---+--------+

PS：如果這不是簡單的字面上述溶液可能不工作。

最後，對於簡單的模式，你可以使用isin：

df.where(col("contents").isin(patterns: _*)).show 

// +---+--------+ 
// | id|contents| 
// +---+--------+ 
// | 1|  bar| 
// | 2|  foo| 
// +---+--------+

也可以加入：

val patternsDF = patterns.map(Tuple1(_)).toDF("contents") 
df.join(broadcast(patternsDF), Seq("contents")).show 

// +---+--------+ 
// | id|contents| 
// +---+--------+ 
// | 1|  bar| 
// | 2|  foo| 
// +---+--------+

來源

2016-05-25 08:01:52 zero323

感謝。它適用於第一種解決方案。第二和第三個結果爲空， – benchuang

謝謝，第一個解決方案符合我的要求。它運作良好。 – benchuang

apache spark sql數據框過濾器按字符串多規則

回答

相關問題