讀取r中的大型固定格式文本文件

我正在嘗試將大型（> 70 MB）固定格式文本文件輸入到r中。對於較小的文件（< 1MB），我可以使用read.fwf（）函數，如下所示。讀取r中的大型固定格式文本文件

condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)

當我嘗試運行的下面的代碼行，

condodattest1 <- read.fwf(impfile,widths=testcsv3$Varlen,col.names=testcsv3$Varname)

我得到以下錯誤消息：

Error: cannot allocate vector of size 2 Kb

的2行之間的唯一區別是大小輸入文件。

我想要導入的文件的格式在名爲testcsv3的數據框中給出。我下面展示的數據幀的一個小片段：

> head(testcsv3) 

    Varlen  Varname Varclass Varsep Varforfmt 
1  2   "V1" "character"  2 "A2.0" 
2  15   "V2" "character"  17 "A15.0" 
3  28   "V3" "character"  45 "A28.0" 
4  3   "V4" "character"  48 "F3.0" 
5  1   "V5" "character"  49 "A1.0" 
6  3   "V6" "character"  52 "A3.0"

我的問題的至少一部分是我在所有的數據，當我使用read.fwf（因素正在讀），我最終超過了內存限制在我的電腦上。

我試圖使用read.table（）作爲格式化每個變量的一種方式，但似乎我需要使用該函數的文本分隔符。在下面的鏈接3.3節中有一個建議，我可以使用sep來標識每個變量開始的列。

http://data.princeton.edu/R/readingData.html

然而，當我使用下面的命令：

condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)

我收到以下錯誤信息：

Error in read.table(impfile1, sep = testcsv3$Varsep, col.names = testcsv3$Varname, : invalid 'sep' argument

最後，我試圖用：

condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)

，但我得到了以下信息：

Error in processFormat(format) : missing lengths for some fields 
In addition: Warning messages: 
1: In processFormat(format) : NAs introduced by coercion 
2: In processFormat(format) : NAs introduced by coercion 
3: In processFormat(format) : NAs introduced by coercion

所有我想在這一點上做的是格式化數據，當他們進入R作爲比因素以外的東西。我希望這會限制我使用的內存量，並允許我實際輸入文件。我將不勝感激關於我如何做到這一點的任何建議。我知道所有變量的Fortran格式和每個變量開始的列。

謝謝

沃倫

來源

2014-02-11 Warren Chrusciel

看看到[FF包（HTTP：//cran.r-project .ORG /網絡/包/ FF/index.html的）。或者，也許這是值得創建一個數據庫和訪問RODBC的數據 – Barranka

看看mnel的答案（最近）在[這裏]（http://stackoverflow.com/questions/1727772/quickly-reading-very-large-表-AS-dataframes-在-R） –

也許這個代碼對你的作品。你必須填寫varlen與現場尺寸並添加相應類型的字符串（例如，數字，字符，整數）到colclasses

my.readfwf <- function(filename,varlen,colclasses) { 
    sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)])) 
    eidx <- sidx+varlen-1 
    filecontent <- scan(filename,character(0),sep="\n") 
    if (any(diff(nchar(filecontent))!=0)) 
    stop("line lengths differ!") 
    nlines <- length(filecontent) 
    res <- list() 
    for (i in seq_along(varlen)) { 
    res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i]) 
    mode(res[[i]]) <- colclasses[i] 
    } 
    attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame") 
    return(res) 
}

來源

2014-02-11 21:51:54 Georg

讀取r中的大型固定格式文本文件

回答

相關問題