2016-09-24 188 views
3

我正在嘗試使用大於2^32的數字。雖然我也在使用data.table和fread,但我不認爲問題與他們有關。我可以在不改變data.table或使用fread的情況下打開和關閉它們的症狀。我的症狀是,當我期待正確的指數1e + 3到1e + 17時,我得到4.1e-302的報告平均值。R,bit64,計算data.table中的行平均值和標準差的問題。表

使用bit64軟件包和與integer64相關的函數時,問題一致出現。事情在「規則大小的數據和R」中適用於我,但我沒有在這個包中正確表達事情。看到我的代碼和數據如下。

我在MacBook Pro,16GB,i7(更新)。

我重新啓動了我的R會話並清除了工作區,但問題始終存在。

請提出建議,我很欣賞這個輸入。我認爲它必須使用庫,bit64。

鏈接我看着包括 bit64 doc

An issue that had similar symptoms caused by an fread() memory leak, but I think I eliminated

這裏是我的輸入數據

var1,var2,var3,var4,var5,var6,expected_row_mean,expected_row_stddev 
1000 ,993 ,987 ,1005 ,986 ,1003 ,996 ,8 
100000 ,101040 ,97901 ,100318 ,96914 ,97451 ,98937 ,1722 
10000000 ,9972997 ,9602778 ,9160554 ,8843583 ,8688500 ,9378069 ,565637 
1000000000 ,1013849241 ,973896894 ,990440721 ,1030267777 ,1032689982 ,1006857436 ,23096234 
100000000000 ,103171209097 ,103660949260 ,102360301140 ,103662297222 ,106399064194 ,103208970152 ,2078732545 
10000000000000 ,9557954451905 ,9241065464713 ,9357562691674 ,9376495364909 ,9014072235909 ,9424525034852 ,334034298683 
1000000000000000 ,985333546044881 ,994067361457872 ,1034392968759970 ,1057553099903410 ,1018695335152490 ,1015007051886440 ,27363415718203 
100000000000000000 ,98733768902499600 ,103316759127969000 ,108062824583319000 ,111332326225036000 ,108671041505404000 ,105019453390705000 ,5100048567944390 

我的代碼,這個樣本的工作數據

# file: problem_bit64.R 
# OBJECTIVE: Using larger numbers, I want to calculate a row mean and row standard deviation 
# ERROR: I don't know what I am doing wrong to get such errors, seems bit64 related 
# PRIORITY: BLOCKED (do this in Python instead?) 
# reported Sat 9/24/2016 by Greg 

# sample data: 
# each row is 100 times larger on average, for 8 rows, starting with 1,000 
# for the vars within a row, there is 10% uniform random variation. B2 = ROUND(A2+A2*0.1*(RAND()-0.5),0)  

# Install development version of data.table --> for fwrite() 
install.packages("data.table", repos = "https://Rdatatable.github.io/data.table", type = "source") 
require(data.table) 
require(bit64) 
.Machine$integer.max # 2147483647  Is this an issue ? 
.Machine$double.xmax # 1.797693e+308 I assume not 

# ------------------------------------------------------------------- 
# ---- read in and basic info that works 
csv_in <- "problem_bit64.csv" 
dt <- fread(csv_in) 
dim(dt)    # 6 8 
lapply(dt, class)  # "integer64" for all 8 
names(dt) # "var1" "var2" "var3" "var4" "var5" "var6" "expected_row_mean" "expected_row_stddev" 
dtin <- dt[, 1:6, with=FALSE] # just save the 6 input columns 

...現在的問題在於室溫

# ------------------------------------------------------------------- 
# ---- CALCULATION PROBLEMS START HERE 
# ---- for each row, I want to calculate the mean and standard deviation 
a <- apply(dtin, 1, mean.integer64); a # get 8 values like 4.9e-321 
b <- apply(dtin, 2, mean.integer64); b # get 6 values like 8.0e-308 

# ---- try secondary variations that do not work 
c <- apply(dtin, 1, mean); c    # get 8 values like 4.9e-321 
c <- apply(dtin, 1, mean.integer64); c # same result 
c <- apply(dtin, 1, function(x) mean(x)); c   # same 
c <- apply(dtin, 1, function(x) sum(x)/length(x)); c # same results as mean(x) 

##### I don't see any sd.integer64  # FEATURE REQUEST, Z-TRANSFORM IS COMMON 
c <- apply(dtin, 1, function(x) sd(x)); c   # unrealistic values - see expected 

常規尺寸R於普通數據,仍然使用數據讀入用fread()成data.table() - WORKS

# ------------------------------------------------------------------- 
# ---- delete big numbers, and try regular stuff - WHICH WORKS 
dtin2 <- dtin[ 1:3, ] # just up to about 10 million (SAME DATA, SAME FREAD, SAME DATA.TABLE) 
dtin2[ , var1 := as.integer(var1) ] # I know there are fancier ways to do this 
dtin2[ , var2 := as.integer(var2) ] # but I want things to work before getting fancy. 
dtin2[ , var3 := as.integer(var3) ] 
dtin2[ , var4 := as.integer(var4) ] 
dtin2[ , var5 := as.integer(var5) ] 
dtin2[ , var6 := as.integer(var6) ] 
lapply(dtin2, class) # validation 

c <- apply(dtin2, 1, mean); c # get 3 row values AS EXPECTED (matching expected columns) 
c <- apply(dtin2, 1, function(x) mean(x)); c   # CORRECT 
c <- apply(dtin2, 1, function(x) sum(x)/length(x)); c # same results as mean(x) 

c <- apply(dtin2, 1, sd); c    # get 3 row values AS EXPECTED (matching expected columns) 
c <- apply(dtin2, 1, function(x) sd(x)); c   # CORRECT 
+0

您是否嘗試過其他大數字的替代品,比如'Brobdingnag'?他們可能不會很好地使用data.table,但你並沒有真正使用data.table特殊功能。你甚至可以用'data.table = FALSE'來使用fread來獲取數據幀。 – dracodoc

回答

1

作爲短和第一推薦對大多數讀者:除非你有使用64位整數的特定原因,否則請使用'double'而不是'integer64'。 'double'是一個R內部數據類型,而'integer64'是一個包擴展數據類型,它被表示爲一個具有類屬性'integer64'的'double'向量,即每個元素的64位被代碼解釋爲64位整數關於這個班級。不幸的是,許多核心R函數不知道'integer64',然後很容易導致錯誤的結果。因此,脅迫,以「雙重」

dtind <- dtin 
for (i in seq_along(dtind)) 
    dtind[[i]] <- as.double(dtind[[i]]) 
b <- apply(dtind, 1, mean) 

會給有所預期的結果

> b 
[1] 9.956667e+02 9.893733e+04 9.378069e+06 1.006857e+09 1.032090e+11 9.424525e+12 1.015007e+15 1.050195e+17 

雖然你預期的不完全是,既不看圓潤差異

> b - dt$expected_row_mean 
integer64 
[1] -1 0 -1 -1 0 -1 -3 -392 

也不看未被環繞的差異

> b - as.double(dt$expected_row_mean) 
[1] -0.3333333 0.3333333 -0.3333333 -0.1666666 0.1666718 -0.3339844 -2.8750000 -384.0000000 
Warnmeldung: 
In as.double.integer64(dt$expected_row_mean) : 
    integer precision lost while converting to double 

好吧,讓我們假設你真的想要 integer64,因爲你的最大數字超出了雙精度整數精度2^52。然後你的問題的事實,「應用」不知道integer64開始,實際上破壞了「integer64」 class屬性:在準備

> apply(dtin, 1, is.integer64) 
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 

它實際上破壞了「integer64」 class屬性兩次,一旦輸入和後處理輸出。我們可以通過

c <- apply(dtin, 1, function(x){ 
    oldClass(x) <- "integer64" # fix 
    mean(x) # note that this dispatches to mean.integer64 
}) 
oldClass(c) <- "integer64" # fix again 

解決這個問題現在結果看起來合理

> c 
integer64 
[1] 995    98937    9378068   1006857435   103208970152  9424525034851  1015007051886437 105019453390704600 

,但仍然不是你所期望

> c - dt$expected_row_mean 
integer64 
[1] -1 0 -1 -1 0 -1 -3 -400 

的微小差異(-1)是由於四捨五入,因爲浮動的意思是

> b[1] 
[1] 995.6667 

並承擔

> dt$expected_row_mean[1] 
integer64 
[1] 996 

而mean.integer64 脅迫(截斷)到integer64。 mean.integer64的這種行爲是值得商榷的,但至少是一致的:

x <- seq(0, 1, 0.25) 
> data.frame(x=x, y=as.integer64(0) + x) 
    x y 
1 0.00 0 
2 0.25 0 
3 0.50 0 
4 0.75 0 
5 1.00 1 
> mean(as.integer64(0:1)) 
integer64 
[1] 0 

四捨五入的主題明確,實施sd.integer64會更值得商榷。它應該返回整數64還是雙精度?

關於差異較大,目前還不清楚你期望的理由是什麼:把你的表的第七行,並從其減去最小

x <- (unlist(dtin[7,])) 
oldClass(x) <- "integer64" 
y <- min(x) 
z <- as.double(x - y) 

給出的數字在「雙」精確處理整數

範圍
> log2(z) 
[1] 43.73759  -Inf 42.98975 45.47960 46.03745 44.92326 

平均那些反對你的期望還是比較給不按四捨五入

解釋的差異3210
+0

非常感謝您的回覆 - 這很好。 –

相關問題