2017-06-19 69 views
1

我解析沒有新線標誌的CSV文件:星火 - 讀CSV沒有新線標誌

"line1field1", "line1field2", "line1field3", "line2field1", "line2field2", "line2field3", "line3field1", "line3field2", "line3field3" 

是否有可能在星火有效地做到這一點? (我想和3排,3場中的每個獲得的數據集)

回答

1

如果我理解你的問題,如果你有輸入數據,而行分隔符爲

"line1field1", "line1field2", "line1field3", "line2field1", "line2field2", "line2field3", "line3field1", "line3field2", "line3field3" 

而且要作爲

輸出
+-------------+-------------+-------------+ 
|Column1  |Column2  |Column3  | 
+-------------+-------------+-------------+ 
|"line1field1"|"line1field2"|"line1field3"| 
|"line2field1"|"line2field2"|"line2field3"| 
|"line3field1"|"line3field2"|"line3field3"| 
+-------------+-------------+-------------+ 

下面的代碼應該可以幫助您實現這一

val data = sc.textFile("path to the input file") 
val todf = data 
    .map(line => line.split(",")).map(array => { 
     val list = new util.ArrayList[Array[String]]() 
     for(index <- 0 to array.length-1 by 3){ 
     list.add(Array(Try(array(index)) getOrElse "", Try(array(index+1)) getOrElse "", Try(array(index+2)) getOrElse "")) 
     } 
     list 
    }) 
    .flatMap(a => a.toArray()) 
    .map(arr => arr.asInstanceOf[Array[String]]) 
    .map(row => Row.fromSeq(Seq(row(0).trim, row(1).trim, row(2).trim))) 

val schema = StructType(Array(StructField("Column1", StringType, true), StructField("Column2", StringType, true),StructField("Column3", StringType, true))) 
sqlContext.createDataFrame(todf, schema).show(false) 

我希望答案是有幫助的

1

如果你想做一個Spark相關的方式,這應該工作。我只是通過資源文件夾導入了csv文件,但將它放在了它所在的字符串中。

import sqlContext.implicits._ 

val columnNames: Seq[String] = Seq("Col1","Col2","Col3") 

sparkContext.textFile(this.getClass.getResource("/test.csv").toString) // your string location here 
    .map(x => x.split(',').sliding(3, 3)) 
    .flatMap(x => x) 
    .map(x => x.toList) 
    .map { case List(a, b, c) => (a, b, c) } //cleanup needed here to convert to Tuple 
    .toDF(columnNames: _*) 
    .show(truncate = false) 

這產生了:

+-----------+-----------+-----------+ 
|Col1  |Col2  |Col3  | 
+-----------+-----------+-----------+ 
|line1field1|line1field2|line1field3| 
|line2field1|line2field2|line2field3| 
|line3field1|line3field2|line3field3| 
+-----------+-----------+-----------+ 

改變滑動,以配合您列數將用於其它尺寸的柱長工作。您將需要更改元組映射,所以這可能不適用於大量的列。

您也許可以查看List to Tuple Answer來查看映射到未知大小列表的元組。