SparkSQL使用正則表達式分割

我想使用正則表達式將一行分割成數組。我的行包含一個Apache日誌，我期待分裂使用SQL。SparkSQL使用正則表達式分割

我試過拆分和數組函數，但沒有。

 
Select split('10.10.10.10 - - [08/Sep/2015:00:00:03 +0000] "GET /index.html HTTP/1.1" 206 - - "Apache-HttpClient" -', '^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^"]+)" \d+ - - "([^"]+)".*') 
;

我期待一個數組6個元素

感謝

來源

2015-09-15 Younes

我建議建立一個蜂巢表如http：//www.dowdandassociates。 com/blog/content/howto-use-hive-with-apache-logs /然後將解析的數據複製到Parquet表中。 –

SPLIT功能，你可以猜測，分割的是模式字符串。由於您提供的模式字符串匹配整個輸入，因此無法返回。因此是一個空陣列。

import org.apache.spark.sql.functions.{regexp_extract, array} 

val pattern = """^([^ ]+) ([^ ]+) ([^ ]+) \[([^\]]+)\] "([^"]+)" \d+ - - "([^"]+)".*""" 

val df = sc.parallelize(Seq((
    1L, """10.10.10.10 - - [08/Sep/2015:00:00:03 +0000] "GET /index.html HTTP/1.1" 206 - - "Apache-HttpClient" -""" 
))).toDF("id", "log")

你所需要的就是regex_extract：

val exprs = (1 to 6).map(i => regexp_extract($"log", pattern, i).alias(s"_$i")) 

df.select(exprs:_*).show 
// +-----------+---+---+--------------------+--------------------+-----------------+ 
// |   _1| _2| _3|     _4|     _5|    _6| 
// +-----------+---+---+--------------------+--------------------+-----------------+ 
// |10.10.10.10| -| -|08/Sep/2015:00:00...|GET /index.html H...|Apache-HttpClient| 
// +-----------+---+---+--------------------+--------------------+-----------------+

，或者例如一個UDF：

val extractFromLog = udf({ 
    val ip = new Regex(pattern) 
    (s: String) => s match { 
    // Lets ignore some fields for simplicity 
    case ip(ip, _, _, ts, request, client) => 
     Some(Array(ip, ts, request, client)) 
    case _ => None 
    } 
}) 

df.select(extractFromLog($"log"))

來源

2016-02-18 18:45:06 zero323

SparkSQL使用正則表達式分割

回答

相關問題