我有一個RDD,看起來密切合作,這過濾器在Apache的Spark在RDD過濾更多的行比預期
1.0,2.0,0.0019,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,3.0,0.0,3.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0 1.0,5.0,-0.0019,-2.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,8.4294 1.0 ,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,7.0,0.0,1.0E-4, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,8.0,0.0,3.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,9040.8,0.0,0.0 ,0.0,0.0,0.0,0.0 1.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0 1.0 ,10.0,-0.0033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,47.03,0.0,0.0,0.0,0.0 1.0,11.0,0.0,-3.0E- 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,554.54,0.0,0.0,0.0,0.0,0.0,0.0,8.140.58,0.0
我需要過濾行數零等於一個特定的數字,比如說15.這個過濾方法的定義是過濾比預期更多的行。
def filterZeroRowsWReadings(row: Array[String]) = {
var flag:Int = 0
for(value <- row) {
if(value.toDouble == 0.0)
flag = flag + 1
}
flag match {
case 15 => false
case _ => true
}
}
但我已手動用零的數目在我的RDD的一個子集計數的行3834,但上述過濾方法除去3960行。現在,我不明白這126排是在哪裏?有沒有辦法找到正在發生的事情?在較小的RDD上,結果如預期的那樣,但對於大型RDD,這是意想不到的。
謝謝。
也許這是一個精確的問題?你可以嘗試比較每個值作爲字符串「0.0」,看看是否改變任何東西。 –
現貨上,我做到了,它按預期工作。但是這不應該發生。 0.00003!= 0.0 – atalpha
取決於您的機器。 0.00003不應該是一個問題,但3E-60可能是。您可能希望使用collect()打印出有差異的行,並將這些行與您的手動方法進行比較。有可能你的手動方法是被破壞的方法。請在下面將答案標記爲「正確」以供將來參考。 –