Spark RDD生命週期：RDD是否會被回收超出範圍

在方法中，我創建了一個新的RDD並對其進行了緩存，無論Spark在rdd超出範圍之後是否會自動取消RDD？Spark RDD生命週期：RDD是否會被回收超出範圍

我在想，但實際發生了什麼？

2015-04-23 jeffery.yuan

不，它不會自動取消。

爲什麼？因爲可能你覺得RDD不再需要了，但是spark模型是在RDD需要進行轉換之前不能實現RDD，所以實際上很難說「我不需要這個RDD」了。即使是你的，它可以是非常棘手的，因爲以下情況：

JavaRDD<T> rddUnion = sc.parallelize(new ArrayList<T>()); // create empty for merging 
for (int i = 0; i < 10; i++) 
{ 
    JavaRDD<T2> rdd = sc.textFile(inputFileNames[i]); 
    rdd.cache(); // Since it will be used twice, cache. 
    rdd.map(...).filter(...).saveAsTextFile(outputFileNames[i]); // Transform and save, rdd materializes 
    rddUnion = rddUnion.union(rdd.map(...).filter(...)); // Do another transform to T and merge by union 
    rdd.unpersist(); // Now it seems not needed. (But is needed actually) 

// Here, rddUnion actually materializes, and needs all 10 rdds that already unpersisted. So, rebuilding all 10 rdds will occur. 
rddUnion.saveAsTextFile(mergedFileName); 
}

信貸的代碼示例到spark-user ml

來源

2015-04-23 09:31:27 C4stor

嗨，@ C4stor感謝您的回答，但檢查https：//開頭的github .com/apache/spark/pull/126和ContextCleaner.scala，似乎Spark做了一些自動清理RDD。所以不知道SPark如何以及何時決定不執行RDD是安全的。 –

Spark RDD生命週期：RDD是否會被回收超出範圍

回答

相關問題