2012-11-23 49 views
4

我正在執行一個大ffdf對象的子集,並且我注意到當我使用subset.ff時,它正在生成大量的NAs。我嘗試了另一種方法,使用,其中結果更快,並且沒有生成NAs。這裏是我的測試:子集ffdf對象(子集vs ffwhich)

library(ffbase) 
# deals is the ffdf I would like to subset 
unique(deals$COMMODITY) 
ff (open) integer length=7 (7) levels: CASH CO2 COAL ELEC GAS GCERT OIL 
    [1] [2] [3] [4] [5] [6] [7] 
CASH CO2 COAL ELEC GAS GCERT OIL 

# Using subset.ff 
started.at=proc.time() 
deals0 <- subset.ff(deals,deals$COMMODITY %in% c("CASH","COAL","CO2","ELEC","GCERT")) 
cat("Finished in",timetaken(started.at),"\n") 
Finished in 12.640sec 
# NAs are generated 
unique(deals0$COMMODITY) 
ff (open) integer length=8 (8) levels: CASH CO2 COAL ELEC GAS GCERT OIL <NA> 
    [1] [2] [3] [4] [5] [6] [7] [8] 
CASH CO2 COAL ELEC GAS GCERT OIL NA  

# Subset using ffwhich 
started.at=proc.time() 
idx <- ffwhich(deals,COMMODITY %in% c("CASH","COAL","CO2","ELEC","GCERT")) 
deals1 <- deals[idx,] 
cat("Finished in",timetaken(started.at),"\n") 
Finished in 3.130sec 
# No NAs are generated 
unique(deals1$COMMODITY) 
ff (open) integer length=7 (7) levels: CASH CO2 COAL ELEC GAS GCERT OIL 
    [1] [2] [3] [4] [5] [6] [7] 
CASH CO2 COAL ELEC GAS GCERT OIL 

任何想法爲什麼會發生這種情況?

回答

4

subset.ff可能使用[和您的標準,但不包括!is.na(.)子句。 「[」的默認值是爲條件向量返回TRUE或NA的項目。常規子集函數增加了一個!is.na(.)子句,但也許ffbase的作者沒有考慮到這一點。

+0

正確!所以最好的選擇是使用ff,在這種情況下, – jwijffels

+0

好的,謝謝你的澄清 – ddg

+0

這個問題已經在CRAN的ffbase 0.6.2版本中介紹了。查看新聞文件:http://cran.r-project.org/web/packages/ffbase/NEWS – jwijffels