我的目標是處理一系列通過調用org.apache.spark.rdd.RDD[_].saveAsObjectFile(...)
產生SequenceFile
文件夾。我的文件夾結構與此類似:範圍界定問題的foreach斯卡拉
\MyRootDirectory
\Batch0001
_SUCCESS
part-00000
part-00001
...
part-nnnnn
\Batch0002
_SUCCESS
part-00000
part-00001
...
part-nnnnn
...
\Batchnnnn
_SUCCESS
part-00000
part-00001
...
part-nnnnn
我需要提取一些持久的數據,但是我的收藏 - 我是否使用ListBuffer
,mutable.Map
,或任何其他可變類型,失去範圍,似乎是newed上來就sequenceFile(...).foreach
每次迭代概念的以下證明產生了一系列的「處理目錄......」接着是「1:1」的反覆,從不增加,如我所料counter
和intList.size
做。
private def proofOfConcept(rootDirectoryName: String) = {
val intList = ListBuffer[Int]()
var counter: Int = 0
val config = new SparkConf().setAppName("local").setMaster("local[1]")
new File(rootDirectoryName).listFiles().map(_.toString).foreach { folderName =>
println(s"Processing directory $folderName...")
val sc = new SparkContext(config)
sc.setLogLevel("WARN")
sc.sequenceFile(folderName, classOf[NullWritable], classOf[BytesWritable]).foreach(f => {
counter += 1
intList += counter
println(s" $counter : ${intList.size}")
})
sc.stop()
}
}
輸出:
"C:\Program Files\Java\jdk1.8.0_111\bin\java" ...
Processing directory C:\MyRootDirectory\Batch0001...
17/05/24 09:30:25.228 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Stage 0:> (0 + 0)/57] 1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
Processing directory C:\MyRootDirectory\Batch0002...
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
Processing directory C:\MyRootDirectory\Batch0003...
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
1 : 1
https://spark.apache.org/docs/latest/programming-guide.html#understanding-closures – zero323
當Spark不在圖片中時,您是否看到這個? –