2014-02-07 26 views
7

當讀取下面的文本時,fread()未能檢測到列8和9中的缺失值。這僅適用於默認選項integer64="integer64"。設置integer64="double""character"正確檢測到NA s。請注意,該文件在V8和V9中有三種可能的NAs-- ,,; , ,;和NA。追加na.strings=c("NA","N/A",""," "), sep=","作爲選項不起作用。fread()失敗,整數64列中的值缺失

使用read.csv()的作用方式與fread(integer="double")的作用相同。要被讀取

文本(也available as a file integer64_and_NA.csv):

2012,276,,0,"S1","001",1,,724135215,1590915056, 
2012,276,2,8,"S1","001",1, ,,154598,0 
2012,276,2,12,"S1","001",1,NA,5118863,21819477, 
2012,276,2,0,"S1","011",8,3127133583,3127133583,9003982501,0 

下面是從fread()輸出:

DT <- fread(input="integer64_and_NA.csv", verbose=TRUE, integer64="integer64", na.strings=c("NA","N/A",""," "), sep=",") 

Input contains no \n. Taking this to be a filename to open 
Detected eol as \r\n (CRLF) in that order, the Windows standard. 
Looking for supplied sep ',' on line 4 (the last non blank line in the first 'autostart') ... found ok 
Found 11 columns 
First row with 11 fields occurs on line 1 (either column names or first row of data) 
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names. 
Count of eol after first data row: 5 
Subtracted 1 for last eol and any trailing empty lines, leaving 4 data rows 
Type codes: 11114412221 (first 5 rows) 
Type codes: 11114412221 (after applying colClasses and integer64) 
Type codes: 11114412221 (after applying drop or select (if supplied) 
Allocating 11 column slots (11 - 0 NULL) 
    0.000s ( 0%) Memory map (rerun may be quicker) 
    0.000s ( 0%) sep and header detection 
    0.000s ( 0%) Count rows (wc -l) 
    0.000s ( 0%) Column type detection (first, middle and last 5 rows) 
    0.000s ( 0%) Allocation of 4x11 result (xMB) in RAM 
    0.000s ( 0%) Reading data 
    0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 
    0.000s ( 0%) Coercing data already read in type bumps (if any) 
    0.000s ( 0%) Changing na.strings to NA 
    0.001s  Total 

所得data.table是:

DT 
    V1 V2 V3 V4 V5 V6 V7     V8     V9  V10 V11 
1: 2012 276 NA 0 S1 001 1 9218868437227407266   724135215 1590915056 NA 
2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266  154598 0 
3: 2012 276 2 12 S1 001 1 9218868437227407266    5118863 21819477 NA 
4: 2012 276 2 0 S1 011 8   3127133583   3127133583 9003982501 0 

NA值是在不是的列中正確檢測到。對於V8和V9,其中fread()標記爲整數64,而不是NAs,我們有「9218868437227407266」。 有趣的是,str()返回V8和V9的相應值NA

str(DT) 

Classes ‘data.table’ and 'data.frame': 4 obs. of 11 variables: 
$ V1 : int 2012 2012 2012 2012 
$ V2 : int 276 276 276 276 
$ V3 : int NA 2 2 2 
$ V4 : int 0 8 12 0 
$ V5 : chr "S1" "S1" "S1" "S1" 
$ V6 : chr "001" "001" "001" "011" 
$ V7 : int 1 1 1 8 
$ V8 :Class 'integer64' num [1:4] NA NA NA 1.55e-314 
$ V9 :Class 'integer64' num [1:4] 3.58e-315 NA 2.53e-317 1.55e-314 
$ V10:Class 'integer64' num [1:4] 7.86e-315 7.64e-319 1.08e-316 4.45e-314 
$ V11: int NA 0 NA 0 
- attr(*, ".internal.selfref")=<externalptr> 

...但沒有別的把它們看作NA

is.na(DT$V8) 
[1] FALSE FALSE FALSE FALSE 
max(DT$V8) 
integer64 
[1] 9218868437227407266 
> max(DT$V8, na.rm=TRUE) 
integer64 
[1] 9218868437227407266 
> class(DT$V8) 
[1] "integer64" 
> typeof(DT$V8) 
[1] "double" 

它似乎並沒有成爲一個打印/只有屏幕問題,data.table將它們視爲巨大整數:

DT[, V12:=as.numeric(V8)] 
Warning message: 
In as.double.integer64(V8) : 
    integer precision lost while converting to double 
> DT 
    V1 V2 V3 V4 V5 V6 V7     V8     V9  V10 V11   V12 
1: 2012 276 NA 0 S1 001 1 9218868437227407266   724135215 1590915056 NA 9.218868e+18 
2: 2012 276 2 8 S1 001 1 9218868437227407266 9218868437227407266  154598 0 9.218868e+18 
3: 2012 276 2 12 S1 001 1 9218868437227407266    5118863 21819477 NA 9.218868e+18 
4: 2012 276 2 0 S1 011 8   3127133583   3127133583 9003982501 0 3.127134e+09 

我錯過了什麼約integer64,或者這是一個錯誤?如上所述,我可以繞過使用integer64="double",可能會失去一些精度,如幫助文件中所述。但意想不到的行爲是默認的integer64 ...

這是在一臺運行Revolution R 3.0.2的Windows 8.1 64位機器以及運行kubuntu 13.10,CRAN-R 3.0.2的虛擬機上完成的。使用來自CRAN的最新穩定data.table(截至2014年2月7日的1.8.10)和1.8.11(1110,2014-02-04 02:43:19的修訂版,從zip作爲r-forge手動安裝建立被打破)在Windows上,只有在Linux上穩定的1.8.10。 bit64在兩臺機器上安裝並加載。

> sessionInfo() 
R version 3.0.2 (2013-09-25) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 

locale: 
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C       
[5] LC_TIME=English_United States.1252  

attached base packages: 
[1] grid  stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] bit64_0.9-3  bit_1.1-11  gdata_2.13.2  xts_0.9-7   zoo_1.7-10  nlme_3.1-113  hexbin_1.26.3  lattice_0.20-24 ggplot2_0.9.3.1 
[10] plyr_1.8   reshape2_1.2.2 data.table_1.8.11 Revobase_7.0.0 RevoMods_7.0.0 RevoScaleR_7.0.0 

loaded via a namespace (and not attached): 
[1] codetools_0.2-8 colorspace_1.2-4 dichromat_2.0-0 digest_0.6.4  foreach_1.4.1  gtable_0.1.2  gtools_3.2.1  iterators_1.0.6 
[9] labeling_0.2  MASS_7.3-29  munsell_0.4.2  proto_0.3-10  RColorBrewer_1.0-5 reshape_0.8.4  scales_0.2.3  stringr_0.6.2  
[17] tools_3.0.2  
+0

從幫助頁面,「此功能仍在開發中」。所以我期望作者將這個問題作爲一個錯誤 –

+0

失望,這還沒有得到解決;如果有任何缺失值,它會使bit64軟件包對data.tables無用。我認爲這個問題必須用'fread',因爲我找不到任何方法來強制bit64包生成該值。它具有完全有效的NA值; 'as.integer64(NA)#' – ClaytonJY

+0

查看相關錯誤,https://github.com/Rdatatable/data.table/issues/488 –

回答

3

這顯然是與bit64包,不fread()data.table的問題。從bit64文檔http://cran.r-project.org/web/packages/bit64/bit64.pdf

「的下標不存在的元件,並用的NA下標目前不支持。這樣的下標當前返回9218868437227407266代替NA(未derlying雙碼的NA值)。繼,完全R行爲這裏可能會破壞性能或需要大量的C編碼。「

我試過9218868437227407266值重新分配到NA以爲它會工作

DT[V8==9218868437227407266, ] 
#actually returns nothing, but 
DT[V8==max(V8), ] 
#returns the rows with 9218868437227407266 in V8 
#but this does not reassign the value 
DT[V8==max(V8), V8:=NA] 
#not that this makes sense, but I tried just in case... 
DT[V8==max(V8), V8:=NA_character_] 

因此,作爲文檔相當明確指出,如果一個向量類integer64它不會承認NA或缺少的值我要避免bit64只是爲了不必處理這...

+0

謝謝。我並不需要*(或使用)整數,而且沒有意識到這個限制。猜測我不會使用bit64,直到解決這個問題,在我的情況下,NAs比大整數要頻繁得多。對於記錄,'DT [as.character(V8)==「9218868437227407266」]'也返回具有大值的行(即NA)。此外,'DT [as.character(V8)==「9218868437227407266」,V8:= as.integer64(NA)]'似乎可以完成這項工作。 – Peter

6

這個bug,#488,現在修復與this commit在開發版data.table v1.9.5,一個如果bit64已加載,nd值將被正確分配(並顯示)爲NA

require(data.table) # v1.9.5 
require(bit64) 
ans = fread("test.csv") 
#  V1 V2 V3 V4 V5 V6 V7   V8   V9  V10 V11 
# 1: 2012 276 NA 0 S1 001 1   NA 724135215 1590915056 NA 
# 2: 2012 276 2 8 S1 001 1   NA   NA  154598 0 
# 3: 2012 276 2 12 S1 001 1   NA 5118863 21819477 NA 
# 4: 2012 276 2 0 S1 011 8 3127133583 3127133583 9003982501 0 
+1

感謝您在這裏以及在github上修復和跟進。 – Peter