在R中使用ffdfwith的操作

我正在使用ff和R，因爲我有一個龐大的數據集（大約16 GB）可以使用。作爲一個測試用例，我得到了一個1M文件，並將其作爲ff數據庫寫出來。在R中使用ffdfwith的操作

system.time(te3 <- read.csv.ffdf(file="testdata.csv", sep = ",", header=TRUE, first.rows=10000, next.rows=50000, colClasses=c("numeric","numeric","numeric","numeric")))

我已經上傳生成的文件（TE3）位置：http://bit.ly/1c8pXqt

我試圖做一個簡單的計算，以創建一個新的變量

ffdfwith(te3, {odfips <- ofips*100000 + dfips})

我得到以下錯誤（有沒有失蹤的記錄），這已經使我失望：

Error in if (by < 1) stop("'by' must be > 0") : missing value where TRUE/FALSE needed 
In addition: Warning message: In chunk.default(from = 1L, to = 1000000L, by = 2293760000, maxindex = 1000000L) : NAs introduced by coercion

任何見解都會b e讚賞。此外，與FF相關，是否有可能在FF數據庫中使用標準R軟件包，如MCMC（我需要使用反伽馬函數）？

TIA，

克里希南

來源

2014-02-26 Krishnan

添加一個額外的變量到ffdf是一個基本的問題，但有幾個選項來達到同樣的目的。見下文。我已經在http://bit.ly/1c8pXqt下載了您的zip文件並將其解壓縮。

require(ffbase) 
load.ffdf(dir="/home/janw/Desktop/stackoverflow/ffdb") 

## Using ffdfwith or with will chunkwise execute the expression 
te3$odfips <- ffdfwith(te3, ofips*100000 + dfips) 
te3$odfips <- with(te3, ofips*100000 + dfips) 
## It is better to restrict to the columns you need in the expression 
## otherwise you are going to load other columns in RAM also which is not needed. 
## This will speedup 
te3$odfips <- ffdfwith(te3[c("ofips","dfips")], ofips*100000 + dfips) 
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips) 
## ffdfwith will look at options("ffbatchbytes") and look at how many rows in your ffdf 
## can be put in 1 batch in order to not overflow options("ffbatchbytes") and hence RAM. 
## So creating this variable will be done in chunks. 
## If you want to specify the chunksize yourself, you can e.g. pass the by argument 
## to with which will be passed on to ?chunk. Eg. below this variable is created 
## in chunks of 100000 records. 
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips, by = 100000) 

## As the Ops * and + are implemented in ffbase for ff vectors you can also do this: 
te3$odfips <- te3$ofips * 100000 + te3$dfips

爲什麼你得到這個錯誤對我來說還不清楚。也許你已經將選項（「ffbatchbytes」）設置爲非常低的數量？我沒有得到這個錯誤。

MCMC的問題太模糊，無法回答。

來源

2014-02-27 09:41:32 jwijffels

感謝您的見解和詳細的意見。我已經設置了我的memory.limit是非常高的，但知道ffbatchbytes。我將用fbbatchbytes進行測試，看看是否仍然出現錯誤。 REGd中。 MCMC我的問題是更普遍的問題 - 標準R軟件包可以與ff一起使用嗎？從我讀的應該，但我不確定。 – Krishnan

呃..啓動機器並再次運行代碼修復它。 – Krishnan

關於標準R封裝。這取決於一些人，其他人需要稍微或更大的變化。 – jwijffels

在R中使用ffdfwith的操作

回答

相關問題