2017-06-29 170 views
0

我有一個這樣的分隔符文件。文件讀取並將其存儲在數組中spark

2:-31:20063:28:0:1496745908:3879:0:0:0:0:6:4:3 
2:-41:20063:28:0:1496745909:3879:0:0:0:0:6:4:3 
2:-35:20063:28:0:1496745910:3879:0:0:0:0:6:4:3 
2:-44:20063:28:0:1496745911:3879:0:0:0:0:6:4:3 
2:-41:20063:28:0:1496745912:3879:0:0:0:0:6:4:3 
2:-51:20063:28:0:1496745913:3879:0:0:0:0:6:4:3 
2:-52:20063:28:0:1496745914:3879:0:0:0:0:6:4:3 
2:-61:20063:28:0:1496745915:3879:0:0:0:0:6:4:3 

我想讀取此文件並將其存儲在數組中。我想訪問每個列以進行聚合。我試過這樣。

def main(args: Array[String]): Unit = { 
val conf = new SparkConf().setAppName("Proximity Filter").setMaster("local[2]").set("spark.executor.memory", "1g") 
val sc = new SparkContext(conf) 
val input = sc.textFile("/home/arun/Desktop/part-r-00000") 
val wordCount = input.flatMap(line => line.split(":")) 
val input1 = wordCount.take(0) 
System.out.print(input1) 
} 
+0

所以你得到任何錯誤的RDD?有什麼問題嗎? – philantrovert

+0

你有使用RDD的特殊原因嗎?我會想象一個更好的解決方案是使用數據框或數據集語義,這將允許您使用csv –

回答

0

更改flatMapmap,你應該罰款,

val wordCount = input.map(line => line.split(":")) 
wordCount.foreach(array => println(array(0), array(1), array(2), array(3), array(4), array(5), array(6), array(7), array(8), array(9), array(10), array(11), array(12))) 

你應該有輸出

(2,-31,20063,28,0,1496745908,3879,0,0,0,0,6,4) 
(2,-41,20063,28,0,1496745909,3879,0,0,0,0,6,4) 
(2,-35,20063,28,0,1496745910,3879,0,0,0,0,6,4) 
(2,-44,20063,28,0,1496745911,3879,0,0,0,0,6,4) 
(2,-41,20063,28,0,1496745912,3879,0,0,0,0,6,4) 
(2,-51,20063,28,0,1496745913,3879,0,0,0,0,6,4) 
(2,-52,20063,28,0,1496745914,3879,0,0,0,0,6,4) 
(2,-61,20063,28,0,1496745915,3879,0,0,0,0,6,4) 

改變你flatMapmap會產生你字符串數組的RDD

scala> val wordCount = input.map(line => line.split(":")) 
wordCount: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:26 

而使用flatMap會給你的字符串

scala> val wordCount = input.flatMap(line => line.split(":")) 
wordCount: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at <console>:26 
+0

dataframereader您可以告訴我相同的Java代碼邏輯? – Rakshita

相關問題