2013-07-12 69 views
4

我有一個樣本數據集像這樣高效地讀取數據:與R中的多個分離線

8 02-Model (Minimum) 
250.04167175293 17.4996566772461 
250.08332824707 17.5000038146973 
250.125 17.5008907318115 
250.16667175293 17.5011672973633 
250.20832824707 17.5013771057129 
250.25 17.502140045166 
250.29167175293 17.5025615692139 
250.33332824707 17.5016822814941 
7 03 (Maximum) 
250.04167175293 17.5020561218262 
250.08332824707 17.501148223877 
250.125 17.501127243042 
250.16667175293 17.5
250.20832824707 17.5016021728516 
250.25 17.5024681091309 
250.29167175293 17.5043239593506 

上的數據文件的第一列單元的行的該特定數據(即,對於02-模型數(最小))。然後,在8行後,我有另一行7 03 (Maximum)這意味着03(最大)我將有7行數據。

我寫功能是如下:

readts <- function(x) 
{ 
    path <- x 
    # Read the first line of the file 
    hello1 <- read.table(path, header = F, nrows = 1,sep="\t") 
    tmp1 <- hello1$V1 
    # Read the data below first line 
    hello2 <- read.table(path, header = F, nrows = (tmp1), skip = 1, 
         col.names = c("Time", "value")) 
    hello2$name <- c(as.character(hello1$V2)) 
    # Read data for the second chunk 
    hello3 <- read.table(path, header = F, skip = (tmp1 + 1), 
         nrows = 1,sep="\t") 
    tmp2 <- hello3$V1 
    hello4 <- read.table(path, header = F, skip = (tmp1 + 2), 
         col.names = c("Time", "value"),nrows=tmp2) 
    hello4$name <- c(as.character(hello3$V2)) 
    # Combine data to create a dataframe 
    df <- rbind(hello2, hello4) 
    return(df) 
} 

我得到的輸出如下所示:

> readts("jdtrial.txt") 
     Time value    name 
1 250.0417 17.49966 02-Model (Minimum) 
2 250.0833 17.50000 02-Model (Minimum) 
3 250.1250 17.50089 02-Model (Minimum) 
4 250.1667 17.50117 02-Model (Minimum) 
5 250.2083 17.50138 02-Model (Minimum) 
6 250.2500 17.50214 02-Model (Minimum) 
7 250.2917 17.50256 02-Model (Minimum) 
8 250.3333 17.50168 02-Model (Minimum) 
9 250.0417 17.50206  03 (Maximum) 
10 250.0833 17.50115  03 (Maximum) 
11 250.1250 17.50113  03 (Maximum) 
12 250.1667 17.50124  03 (Maximum) 
13 250.2083 17.50160  03 (Maximum) 
14 250.2500 17.50247  03 (Maximum) 
15 250.2917 17.50432  03 (Maximum) 

jdtrial.txt是我在上面示出的數據。但是,當我有多個分隔符的大數據時,我的函數不起作用,我需要添加更多的行,這使得函數更加混亂。有沒有更簡單的方法來讀取這樣的數據文件?謝謝。

預期的數據是我得到的數據。您可以嘗試的數據如下:

8 02-Model (Minimum) 
250.04167175293 17.4996566772461 
250.08332824707 17.5000038146973 
250.125 17.5008907318115 
250.16667175293 17.5011672973633 
250.20832824707 17.5013771057129 
250.25 17.502140045166 
250.29167175293 17.5025615692139 
250.33332824707 17.5016822814941 
7 03 (Maximum) 
250.04167175293 17.5020561218262 
250.08332824707 17.501148223877 
250.125 17.501127243042 
250.16667175293 17.5
250.20832824707 17.5016021728516 
250.25 17.5024681091309 
250.29167175293 17.5043239593506 
8 04-Model (Maximum) 
250.04167175293 17.5020561218262 
250.08332824707 17.501148223877 
250.125 17.501127243042 
250.16667175293 17.5
250.20832824707 17.5016021728516 
250.25 17.5024681091309 
250.29167175293 17.5043239593506 
250.33332824707 17.5055828094482 

回答

3

其不清楚多個分隔符是指什麼,但這裏是一個解決方案在那個地址你實際顯示的數據。

在數據中使用fill=TRUE來填寫空字段。使用is.hdr跟蹤哪些行是標題。將V2轉換爲數字(在標題行中將V2替換爲NA,以便它們不會生成警告)。然後用接下來的兩列中的NAs替換非標題行,並使用na.locf(link)來使用標題填充NA。最後,只保留非標題行。

library(zoo) 
DF <- read.table("jdtrial.txt", fill = TRUE, as.is = TRUE) 

is.hdr <- DF$V3 != "" 
transform(DF, 
    V2 = as.numeric(replace(V2, is.hdr, NA)), 
    V3 = na.locf(ifelse(is.hdr, V2, NA)), 
    name = na.locf(ifelse(is.hdr, V3, NA)))[!is.hdr, ] 

的最後一條語句的結果是:

  V1  V2  V3  name 
2 250.0417 17.49966 02-Model (Minimum) 
3 250.0833 17.50000 02-Model (Minimum) 
4 250.1250 17.50089 02-Model (Minimum) 
5 250.1667 17.50117 02-Model (Minimum) 
6 250.2083 17.50138 02-Model (Minimum) 
7 250.2500 17.50214 02-Model (Minimum) 
8 250.2917 17.50256 02-Model (Minimum) 
9 250.3333 17.50168 02-Model (Minimum) 
11 250.0417 17.50206  03 (Maximum) 
12 250.0833 17.50115  03 (Maximum) 
13 250.1250 17.50113  03 (Maximum) 
14 250.1667 17.50124  03 (Maximum) 
15 250.2083 17.50160  03 (Maximum) 
16 250.2500 17.50247  03 (Maximum) 
17 250.2917 17.50432  03 (Maximum) 
19 250.0417 17.50206 04-Model (Maximum) 
20 250.0833 17.50115 04-Model (Maximum) 
21 250.1250 17.50113 04-Model (Maximum) 
22 250.1667 17.50124 04-Model (Maximum) 
23 250.2083 17.50160 04-Model (Maximum) 
24 250.2500 17.50247 04-Model (Maximum) 
25 250.2917 17.50432 04-Model (Maximum) 
26 250.3333 17.50558 04-Model (Maximum) 
+0

不錯。這是迄今爲止最好的選擇。每個人都忽視了'fill = TRUE'的說法。 – thelatemail

+0

它似乎簡短而方便,但我不熟悉動物園包。儘管你的解釋很有幫助。 –

1

下面是一個似乎適用於您的示例數據的函數。它返回listdata.frame s,但如果您願意,您可以使用do.call(rbind, ...)獲得單個data.frame

myFun <- function(textfile) { 
    # Read the lines of your text file 
    x <- readLines(textfile) 
    # Identify lines that start with space followed 
    # by numbers followed by space followed by 
    # numbers. By the looks of it, matching the 
    # space at the start of the line might be 
    # sufficient at this stage. 
    myMatch <- grep("^\\s[0-9]+\\s+[0-9]+", x) 
    # Extract the first number, which tells us how 
    # many values need to be read in. 
    scanVals <- as.numeric(gsub("^\\s+([0-9]+)\\s+.*", 
           "\\1", x[myMatch])) 
    # Extract. I've used seq_along which is like 
    # 1:length(myMatch) 
    temp <- lapply(seq_along(myMatch), function(y) { 
    # scan will return just a single vector, but your 
    # data are in pairs, so we convert the vector to 
    # a matrix filled in by row 
    t1 <- matrix(scan(textfile, skip = myMatch[y], 
         n = scanVals[y]*2), ncol = 2, 
       byrow = TRUE) 
    # Add column names to the matrix 
    colnames(t1) <- c("time", "value") 
    # Convert the matrix to a data.frame and add the 
    # name column using cbind. 
    cbind(data.frame(t1), 
      name = gsub("^\\s+([0-9]+)\\s+(.*)", "\\2", 
         x[myMatch])[y]) 
    }) 
    # Return the list we just created 
    temp 
} 

實施例的使用將是:

myFun("mytest.txt")     ## list output 

do.call(rbind, myFun("mytest.txt")) ## Single data.frame 
+0

謝謝你這麼多。完美的作品,但我想明白。 –

+0

@Jdbaba,看我的編輯。如果還不清楚,請告訴我。 '> READFILE(「trial2.txt」) 讀1項 錯誤seq_len(nlines): 參數必須強制轉換到非負整數 另外:警告消息: – A5C1D2H2I1M1N2O1R2T1

1

閱讀使用readLines中的數據,然後執行數據的每個塊中的序列。這樣就避免了對模型名稱的假設或者用正則表達式來擺弄。你必須使用循環,而不是[sl]apply,但確實沒有什麼問題。

readFile <- function(file) 
{ 
    con <- readLines(file) 
    i <- 1 
    chunks <- list() 
    while(i < length(con)) 
    { 
     type <- scan(text=con[i], what=character(2), sep="\t") 
     nlines <- as.numeric(type[1]) 
     dat <- cbind(read.delim(text=con[i+seq_len(nlines)], header=FALSE), 
        type=type[2]) 
     chunks <- c(chunks, list(dat)) 
     i <- i + nlines + 1 
    } 
    do.call(rbind, chunks) 
} 
+0

我使用你的函數有錯誤 來港推出通過強制' –

+0

我假設你的數據有製表符分隔符,正如你的文章所暗示的那樣。你是否在標籤被轉換爲空格的輸入上運行它(例如,通過剪切並粘貼到/到SO)? –

1

編輯來取代我在@ G.Grothendieck的遠更好的答案的光原來的答覆。這在很大程度上是該答案的變體。

別的去了,在那裏爲示範的目的,test就像原始文本:

test <-" 1 02-Model (Minimum) 
250.04167175293 17.4996566772461 
1 03 (Maximum) 
250.04167175293 17.5020561218262 
1 04-Model (Maximum) 
250.04167175293 17.5020561218262" 

對其進行處理:

interm <- read.table(
    text = test, fill = TRUE, as.is = TRUE, 
    col.names=c("Time","Value","Name") 
) 

keys <- which(interm$Name != "") 

interm$Name <- rep(
    apply(interm[keys,][-1],1,paste0,collapse=""), 
    diff(c(keys,nrow(interm)+1)) 
) 

result <- interm[-(keys),] 

結果:

 Time   Value    Name 
2 250.0417 17.4996566772461 02-Model(Minimum) 
4 250.0417 17.5020561218262  03(Maximum) 
6 250.0417 17.5020561218262 04-Model(Maximum)