2016-12-30 27 views
4

如何從下面的spark工作中刪除輸出中的括號「(」和「)」?如何刪除RDD [(String,Int)]上的saveAsTextFile時記錄周圍的括號?

當我嘗試使用PigScript讀取spark輸出時,它會產生一個問題。

我的代碼:

scala> val words = Array("HI","HOW","ARE") 
words: Array[String] = Array(HI, HOW, ARE) 

scala> val wordsRDD = sc.parallelize(words) 
wordsRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:23 

scala> val keyvalueRDD = wordsRDD.map(elem => (elem,1)) 
keyvalueRDD: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at <console>:25 

scala> val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y) 
wordcountRDD: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[2] at reduceByKey at <console>:27 

scala> wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles") 

輸出按照上面的代碼:

hadoop dfs -cat /user/cloudera/outputfiles/part* 

(HOW,1) 
(ARE,1) 
(HI,1) 

但我想火花的輸出將被存儲如下面作爲沒有括號

HOW,1 
ARE,1 
HI,1 

現在我想用PigScript讀取上面的輸出。在Pigscript對待「(HOW」作爲第一個原子和「1)」作爲第二個原子

反正是有,我們可以擺脫掉火花代碼本身作爲括號我不想應用

LOAD語句修復該pigscript ..

豬腳本:

records = LOAD '/user/cloudera/outputfiles' USING PigStorage(',') AS (word:chararray); 
dump records; 

豬輸出:

((HOW) 
((ARE) 
((HI) 

回答

2

使用map改造你的記錄保存到outputfiles目錄,例如前

wordcountRDD.map { case (k, v) => s"$k, $v" }.saveAsTextFile("/user/cloudera/outputfiles") 

請參閱Spark's documentation about map


我強烈建議使用數據集來代替。

scala> words.toSeq.toDS.groupBy("value").count().show 
+-----+-----+ 
|value|count| 
+-----+-----+ 
| HOW| 1| 
| ARE| 1| 
| HI| 1| 
+-----+-----+ 

scala> words.toSeq.toDS.groupBy("value").count.write.csv("outputfiles") 

$ cat outputfiles/part-00199-aa752576-2f65-481b-b4dd-813262abb6c2-c000.csv 
HI,1 

請參閱Spark SQL, DataFrames and Datasets Guide

1

此格式是元組的格式。您可以手動定義格式:

val wordcountRDD = keyvalueRDD.reduceByKey((x,y) => x+y) 
           // here we set custom format 
           .map(x => x._1 + "," + x._2) 
wordcountRDD.saveAsTextFile("/user/cloudera/outputfiles")