2016-03-01 25 views
0

我是新手,我需要此問題的幫助。拆分字段並從一行創建多行Spark-Scala

我有一個CSV文件是這樣的:

ANI,2974483123 29744423747 293744450542,Twitter,@ani 

我需要拆分第二列 「2974483123 29744423747 293744450542」,並創建3行是這樣的:

ANI,2974483123,Twitter,@ani 

ANI,29744423747,Twitter,@ani 

ANI,293744450542,Twitter,@ani 

有人能幫助我嗎?請!

回答

7

flatMap是你在找什麼:

val input: RDD[String] = sc.parallelize(Seq("ANI,2974483123 29744423747 293744450542,Twitter,@ani")) 
val csv: RDD[Array[String]] = input.map(_.split(',')) 

val result = csv.flatMap { case Array(s1, s2, s3, s4) => s2.split(" ").map(part => (s1, part, s3, s4)) } 
+0

只專注於三個國家:美國,加拿大,MX .....原始記錄: [ 「MotelID」, 「BidDate」, 「胡」, 「英國」, 「NL」, 「美」 ,MX,AU,CA,CN,KR,BE,I,JP,IN,HN,GY, [0000002,11-05-08-2016,0.92,1.68,0.81,0.68,1.59,1.63,1.77,2.06,0.66,1.53,0.32,0.88,0.83,1.01] 只保留三個重要的 0000002 ,11-05-08-2016,1.59,,1.77 轉置記錄並將相關Losa包含在單獨的列中 0000002,11-05-08-2016,US,1.59 0000002,11-05-08-2016 ,MX, 0000002,11-05-08-2016,CA,1.77 ....如何獲得以上結果? – user3252097

0

非常相似Tzach的答案,但在python2和並注意多空格分隔。

import re 

rdd = sc.textFile("datasets/test.csv").map(lambda x: x.split(",")) 

print(rdd.take(1)) 
print(rdd.map(lambda (a, b, c, d): [(a, number, c, d) for number in re.split(" +", b)]) 
     .flatMap(lambda x: x) 
     .take(10)) 

#[[u'ANI', u'2974481249 2974444747 2974440542', u'Twitter', u'maximotussie']] 
#[(u'ANI', u'2974481249', u'Twitter', u'maximotussie'), 
# (u'ANI', u'2974444747', u'Twitter', u'maximotussie'), 
# (u'ANI', u'2974440542', u'Twitter', u'maximotussie')] 
1

這是一個略有不同的解決方案,它利用Spark提供的內置SQL UDF。理想情況下,應該使用這些代替自定義函數來利用查詢優化器(https://blog.cloudera.com/blog/2017/02/working-with-udfs-in-apache-spark/)提供的性能改進。

import org.apache.spark.sql.functions.{split, explode} 

val filename = "/path/to/file.csv" 
val columns = Seq("col1","col2","col3","col4") 

val df = spark.read.csv(filename).toDF(columns: _*) 

// import "split" instead of writing your own split UDF 
df.withColumn("col2", split($"col2", " ")). 
    // import "explode instead of map then flatMap 
    select($"col1", explode($"col2"), $"col3", $"col4").take(10)