2017-06-13 42 views
1

我有一個RDD,看起來密切合作,這過濾器在Apache的Spark在RDD過濾更多的行比預期

1.0,2.0,0.0019,0.0,0.0,0.0,0.0,0.0,0.0,0.0 ,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,3.0,0.0,3.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0 1.0,5.0,-0.0019,-2.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,8.4294 1.0 ,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,7.0,0.0,1.0E-4, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0 1.0,8.0,0.0,3.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,9040.8,0.0,0.0 ,0.0,0.0,0.0,0.0 1.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0 1.0 ,10.0,-0.0033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,47.03,0.0,0.0,0.0,0.0 1.0,11.0,0.0,-3.0E- 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,554.54,0.0,0.0,0.0,0.0,0.0,0.0,8.140.58,0.0

我需要過濾行數零等於一個特定的數字,比如說15.這個過濾方法的定義是過濾比預期更多的行。

def filterZeroRowsWReadings(row: Array[String]) = { 
    var flag:Int = 0 
    for(value <- row) { 
     if(value.toDouble == 0.0) 
     flag = flag + 1 
    } 
    flag match { 
     case 15 => false 
     case _ => true 
    } 
} 

但我已手動用零的數目在我的RDD的一個子集計數的行3834,但上述過濾方法除去3960行。現在,我不明白這126排是在哪裏?有沒有辦法找到正在發生的事情?在較小的RDD上,結果如預期的那樣,但對於大型RDD,這是意想不到的。

謝謝。

+1

也許這是一個精確的問題?你可以嘗試比較每個值作爲字符串「0.0」,看看是否改變任何東西。 –

+0

現貨上,我做到了,它按預期工作。但是這不應該發生。 0.00003!= 0.0 – atalpha

+0

取決於您的機器。 0.00003不應該是一個問題,但3E-60可能是。您可能希望使用collect()打印出有差異的行,並將這些行與您的手動方法進行比較。有可能你的手動方法是被破壞的方法。請在下面將答案標記爲「正確」以供將來參考。 –

回答

1

也許這是一個精確的問題?你可以嘗試比較每個值作爲字符串「0.0」,看看是否改變任何東西。

相關問題