2014-01-10 63 views
0

我在行之間有一個文本文件,其元素不相等。有時第二列包含數據,有時包含NA,有時根本沒有記錄。我知道,如果連續只有4個元素,我應該在第二列中插入一個NA作爲元素。但是,我不知道該怎麼做。下面是一個例子的數據集:將NA插入讀取爲字符串的數據中

abc.def ghi.jkl mno pqr A* 
bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A* 

下面是所期望的結果:

desired.result <- read.table(text = ' 
    Name1 Name2 Name3 Name4 Status 
abc.def ghi.jkl mno pqr  A* 
bc.def  NA  no qr  A 
c-e.ef  NA non qrr  AE 
    fg.gg  NA  no qr  E 
    aa.bb cc.dd  ee ff  A* 
', header = TRUE) 

我還沒有得到遠,但我已經能夠分割數據並將其輸入到與一個matrix以下代碼。當然,這些數據是錯位的。

setwd('c:/users/mmiller21/simple R programs') 

my.data <- readLines('name_data.txt') 

matrix(unlist(strsplit(unlist(my.data), " ")), ncol=5, byrow=TRUE) 

#  [,1]  [,2]  [,3] [,4]  [,5]  
# [1,] "abc.def" "ghi.jkl" "mno" "pqr"  "A*"  
# [2,] "bc.def" "NA"  "no" "qr"  "A"  
# [3,] "c-e.ef" "non"  "qrr" "AE"  "fg.gg" 
# [4,] "no"  "qr"  "E" "aa.bb" "cc.dd" 
# [5,] "ee"  "ff"  "A*" "abc.def" "ghi.jkl" 

不知何故我應該使用strsplit(unlist(my.data), " ")後計數元件的數量然後插入NA如在每一行中僅包含四個元件的第二元件。然後將數據輸入到矩陣中。感謝您的幫助。我寧願基地R.

回答

2

與文件名替換dat

dat <- textConnection("abc.def ghi.jkl mno pqr A* 
bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A*") 

my.lines <- readLines(dat) 
my.rows <- strsplit(my.lines, " ") 
adjust <- function(row) { 
    if (length(row) == 4) c(head(row, 1), NA, tail(row, 3)) 
    else row 
} 
my.fixed <- lapply(my.rows, adjust) 

out <- matrix(unlist(my.fixed), ncol = 5, byrow = TRUE) 
out[out == "NA"] <- NA 
2

您可以使用選項fill=TRUE,然後翻譯遺漏行:

dat <- read.table(text='abc.def ghi.jkl mno pqr A* 
    bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A*',fill=TRUE) 

t(apply(dat,1,function(x){ 
    if(nchar(x[5])==0) 
    x= c(x[1],NA_character_,x[2:4]) 
    x 
})) 

    [,1]  [,2]  [,3] [,4] [,5] 
[1,] "abc.def" "ghi.jkl" "mno" "pqr" "A*" 
[2,] "bc.def" NA  "no" "qr" "A" 
[3,] "c-e.ef" NA  "non" "qrr" "AE" 
[4,] "fg.gg" NA  "no" "qr" "E" 
[5,] "aa.bb" "cc.dd" "ee" "ff" "A*" 
3
dat <- read.table(text="abc.def ghi.jkl mno pqr A* 
bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A*", fill=TRUE, stringsAsFactors=FALSE) 
names(dat) <- c('Name1' , 'Name2', 'Name3', 'Name4','Status') 
is.na(dat[[5]]) <- dat[[5]]=="" # set blanks in col 5 to NA 
t(apply(dat, 1, function(r) if(is.na(r[5])) {r[c(1,5,2:4)]}else {r})) 
#--------- 
    [,1]  [,2]  [,3] [,4] [,5] 
[1,] "abc.def" "ghi.jkl" "mno" "pqr" "A*" 
[2,] "bc.def" NA  "no" "qr" "A" 
[3,] "c-e.ef" NA  "non" "qrr" "AE" 
[4,] "fg.gg" NA  "no" "qr" "E" 
[5,] "aa.bb" "cc.dd" "ee" "ff" "A*" 
+1

魔'is.na(DAT [[5]])< - DAT [[5]] ==」 「'! – agstudy

+0

這相當於@ agstudy's,除了他允許最後一列包含「NA」。 – flodel

+0

迪寧,你改了你的名字! (我一段時間沒有去過這個網站。) –

1

readlines方法,用空格字符分割,並追加NA:

txt <- readLines(file) 
t(sapply(strsplit(txt, "\\s+"), function(x) if(length(x) < 5) append(x, NA, 1) else x)) 
#  [,1]  [,2]  [,3] [,4] [,5] 
# [1,] "abc.def" "ghi.jkl" "mno" "pqr" "A*" 
# [2,] "bc.def" "NA"  "no" "qr" "A" 
# [3,] "c-e.ef" NA  "non" "qrr" "AE" 
# [4,] "fg.gg" NA  "no" "qr" "E" 
# [5,] "aa.bb" "cc.dd" "ee" "ff" "A*" 

完整版本與數據管理:

file <- tempfile() 
cat("abc.def ghi.jkl mno pqr A* 
bc.def NA no qr A 
c-e.ef non qrr AE 
fg.gg no qr E 
aa.bb cc.dd ee ff A*", "\n", sep="", file=file) 
txt <- readLines(file) 
t(sapply(strsplit(txt, "\\s+"), function(x) if(length(x) < 5) append(x, NA, 1) else x)) 
unlink(file) 

注意這類似於@Flodel