SPARK - 如何強制錯誤的sc.parallelize

此語句總是給出正確的結果，不管多大paralleziation如何提供。爲什麼它總是給出正確的結果？

讀大文件或mapPartitions方法會導致精度的輕微損失，爲什麼不在這裏？它一定很簡單，但我看不到它。

val rdd = sc.parallelize(Array("A", "B", "C", "D", "E", "F"),5) 
rdd.sliding(2).collect()

來源

2016-11-08 thebluephantom

讀一個大文件或mapPartitions方法將導致精度的局部損失，

不會。結果與來源完全無關。

來源

2016-11-08 11:43:00

但是由於什麼原因？如果我有一個大文件，我會理解在數據邊界處會有損失。同樣，如果我爲了參數而寫一個mapPartitions，也是一樣的。我寫了一個這樣的def，並且很容易證明。那麼，這裏的原因是什麼，它總是好的。對於總和和乘法，這很容易遵循。 – thebluephantom

我明白沒有損失，但我正在尋找爲什麼的原因！我模擬了一個mapPartitions方法。整個想法是肯定有超過1個mapPartition？ – thebluephantom

最後一點不明白。 – thebluephantom

從Hortonworks：

滑動（）保持跟蹤分區索引，在此情況下對應於unigram進行的排序的。

Compare rdd.mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() and rdd.sliding(2).mapPartitionsWithIndex { (i, p) => p.map { e => (i, e) } }.collect() to help with the intuition.

來源

2016-11-08 18:50:09 thebluephantom

SPARK - 如何強制錯誤的sc.parallelize

回答

相關問題