2017-01-11 56 views
0

我目前正在引進大型數據集分爲R和我發現FREAD()從data.tables能夠使其在一個合理的時間(read.csv對我來說真的很慢)。[R data.table FREAD()在整個文本文件不完全帶來

我目前遇到了一對夫婦的問題,我想獲得的一些見解。我在列名前面有這個「??」標記,我可以用重命名語句快速修復它,但是此列中的值與原始文件完全不同。該值應爲一個16位識別碼(像這樣「1100110011001100」),但,當它被帶進來,它有作爲「3.598E-310」。

我不知道這是否是由於UTF-8格式我的數據是,但我有一些麻煩搞清楚是怎麼回事。還有另一個具有相似特徵的變量(12位數字代碼),它也被指數化了。我的變量的其餘所有看起來很好(除了與相同長度的其他變量爲被帶到錯誤的兩個變量)。

回答

1

你應該得到一個善意的警告:

library(data.table) #1.10.0 

DT <- fread("1100110011001100 
     1100110011001100") 
#Warning message: 
#In fread("1100110011001100\n  1100110011001100") : 
# Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again. 

print(DT) 
#    V1 
#1: 5.435266e-309 
#2: 5.435266e-309 
#Warning message: 
#In print.data.table(DT) : 
# Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again. 

library(bit64) 
print(DT) 
#     V1 
#1: 1100110011001100 
#2: 1100110011001100 
1

如果我理解正確OP的16位識別碼,就是要式的人物。

但是,fread()確定某些示例行的列類型(有關詳細信息,請參閱?fread)。顯然,它試圖爲integer64讀取數據。該colClasses參數可用於通過fread()覆蓋所做的猜測:

DT <- fread("1100110011001100 
     1100110011001100", colClasses = "character") 
DT 
#     V1 
#1: 1100110011001100 
#2: 1100110011001100 

如果verbose參數設置爲TRUEfread()揭示了它的一些內部運作的:

DT <- fread("1100110011001100 
     1100110011001100", colClasses = "character", verbose = TRUE) 
Input contains a \n (or is ""). Taking this to be text input (not a filename) 
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard. 
Positioned on line 1 after skip or autostart 
This line is the autostart and not blank so searching up for the last non-blank ... line 1 
Detecting sep ... Deducing this is a single column input. 
Starting data input on line 1 (either column names or first row of data). First 10 characters: 1100110011 
Some fields on line 1 are not type character (or are empty). Treating as a data row and using default column names. 
Count of eol: 2 (including 0 at the end) 
ncol==1 so sep count ignored 
Type codes (point 0): 2 
Column 1 ('V1') was detected as type 'integer64' but bumped to 'character' as requested by colClasses 
Type codes: 4 (after applying colClasses and integer64) 
Type codes: 4 (after applying drop or select (if supplied) 
Allocating 1 column slots (1 - 0 dropped) 
Read 2 rows. Exactly what was estimated and allocated up front 
    0.000s ( 0%) Memory map (rerun may be quicker) 
    0.000s ( 0%) sep and header detection 
    0.000s ( 0%) Count rows (wc -l) 
    0.000s ( 0%) Column type detection (100 rows at 10 points) 
    0.000s ( 0%) Allocation of 2x1 result (xMB) in RAM 
    0.000s ( 0%) Reading data 
    0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered 
    0.000s ( 0%) Coercing data already read in type bumps (if any) 
    0.000s ( 0%) Changing na.strings to NA 
    0.001s  Total 

這可能有助於分析用12位數字代碼讀取變量的問題。