2015-09-14 74 views
0

我使用下面的代碼導出DataFrame如何合併兩個文本文件,並轉換成csv文件斯卡拉

df.select("A", "b", "C", "D","E") 
    .write.format("com.databricks.spark.csv") 
    .save("newiris.csv") 

我得到兩個文本文件如下:

部分00000

5.1,3.5,1.4,0.2,Iris-setosa 
4.9,3,1.4,0.2,Iris-setosa 
4.7,3.2,1.3,0.2,Iris-setosa 
4.6,3.1,1.5,0.2,Iris-setosa 
5,3.6,1.4,0.2,Iris-setosa 
5.4,3.9,1.7,0.4,Iris-setosa 

部分00001

6.7,3,5,1.7,Iris-versicolor 
6,2.9,4.5,1.5,Iris-versicolor 
5.7,2.6,3.5,1,Iris-versicolor 
5.5,2.4,3.8,1.1,Iris-versicolor 
5.5,2.4,3.7,1,Iris-versicolor 
5.8,2.7,3.9,1.2,Iris-versicolor 

現在我想擁有它們組合成一個文件中像

5.1,3.5,1.4,0.2,Iris-setosa 
4.9,3,1.4,0.2,Iris-setosa 
4.7,3.2,1.3,0.2,Iris-setosa 
4.6,3.1,1.5,0.2,Iris-setosa 
5,3.6,1.4,0.2,Iris-setosa 
5.4,3.9,1.7,0.4,Iris-setosa 
6.7,3,5,1.7,Iris-versicolor 
6,2.9,4.5,1.5,Iris-versicolor 
5.7,2.6,3.5,1,Iris-versicolor 
5.5,2.4,3.8,1.1,Iris-versicolor 
5.5,2.4,3.7,1,Iris-versicolor 
5.8,2.7,3.9,1.2,Iris-versicolor 

然後將其轉換爲CSV。我如何在Scala中做到這一點?

回答

1

必要的斯卡拉這裏位被scala.io.Source讀取該文件,並得到了線,++追加part0-00000part-00001foreach循環都要經過組合的數據和寫入文件。文件I/O與Java中的相同。

scala> import java.io._ 

scala> import scala.io.Source 

scala> val part0 = Source.fromFile("part-00000.txt").getLines 
part0: Iterator[String] = non-empty iterator 

scala> val part1 = Source.fromFile("part-00001.txt").getLines 
part1: Iterator[String] = non-empty iterator 

scala> val part2 = part0.toList ++ part1.toList 
part2: List[String] = List(5.1,3.5,1.4,0.2,Iris-setosa, 4.9,3,1.4,0.2,Iris-setosa, 4.7,3.2,1.3,0.2,Iris-setosa, 4.6,3.1,1.5,0.2,Iris-setosa, 5,3.6,1.4,0.2,Iris-setosa, 5.4,3.9,1.7,0.4,Iris-setosa, 6.7,3,5,1.7,Iris-versicolor, 6,2.9,4.5,1.5,Iris-versicolor, 5.7,2.6,3.5,1,Iris-versicolor, 5.5,2.4,3.8,1.1,Iris-versicolor, 5.5,2.4,3.7,1,Iris-versicolor, 5.8,2.7,3.9,1.2,Iris-versicolor) 

scala> val part00002 = new File("part-00002") 
part00002: java.io.File = part-00002 

scala> val bw = new BufferedWriter(new FileWriter(part00002)) 
bw: java.io.BufferedWriter = [email protected] 

scala> part2.foreach(p => bw.write(p + "\n")) 


scala> bw.close 

檢查文件:

brian:/tmp/ $ cat part-00002                
5.1,3.5,1.4,0.2,Iris-setosa 
4.9,3,1.4,0.2,Iris-setosa 
4.7,3.2,1.3,0.2,Iris-setosa 
4.6,3.1,1.5,0.2,Iris-setosa 
5,3.6,1.4,0.2,Iris-setosa 
5.4,3.9,1.7,0.4,Iris-setosa 
6.7,3,5,1.7,Iris-versicolor 
6,2.9,4.5,1.5,Iris-versicolor 
5.7,2.6,3.5,1,Iris-versicolor 
5.5,2.4,3.8,1.1,Iris-versicolor 
5.5,2.4,3.7,1,Iris-versicolor 
5.8,2.7,3.9,1.2,Iris-versicolor 
+0

非常感謝!當我做val part00002 =新文件(「part-00002」)我得到一個錯誤沒有找到:鍵入文件。我需要定義文件還是導入? – Tong

+0

'import java.io._'應該這樣做。 – Brian

+0

謝謝!它工作完美。還有一個問題,如果part-00000和part-00001採用csv格式,這個操作會更容易嗎? – Tong