2012-10-12 94 views
2

我的數據按特定方式排列,沒有標題,列中不一定包含相同類型的信息。它的一部分可以用來生產:刪除'NA'之後的文件行

data <- textConnection("rs123,22,337647,C,T 
1,7385,0.4156,-0.0019,0.0037 
1,16550,0.959163800640972,-0.0241,0.0128 
1,17218,0.0528,0.015,0.039 
rs193,22,366349,C,T 
1,7385,0.3708,0.0017,0.0035 
1,16550,0.793259111116741,-0.0028,0.009 
1,17218,0.9547,-0.016,0.033 
rs194,22,366300,NA,NA 
0,0,0,0,0 
0,0,0,0,0 
0,0,0,0,0 
rs118,22,301327,C,T 
1,7385,0.0431,-0.0085,0.0077 
1,16550,0.789981059331214,0.0036,0.0092 
1,17218,0.99,-0.057,0.062 
rs120,22,497528,C,G 
1,7385,0.0716,0.0012,0.0073 
1,16550,0.233548238634496,-0.0033,0.0064 
1,17218,0.4563,-0.002,0.015 
rs109,22,309825,A,G 
1,5520,0.8611,2e-04,0.0044 
0,0,0,0,0 
1,17218,0.9762,0.076,0.044 
rs144,22,490068,C,T 
0,0,0,0,0 
0,0,0,0,0 
1,17218,0.2052,-0.013,0.032") 
mydata <- read.csv(data, header = F, sep = ",", stringsAsFactors=FALSE) 

我的問題是:我可以寫一個線到grep含有「NA」/AWK線(它們是不包含數據的SNP)

grep -v 'NA' file.in > file.out 

但是,我怎麼才能指出,以下3行也被刪除?我不想刪除包含所有零的每一行,只有包含全部零的行包含帶有'NA'的SNP的行。

謝謝您的意見!

回答

3

使用GNU sed(因爲地址以下的行數爲擴展名):

sed -e '/NA/,+3 d' infile 

編輯添加awk解決方案:

awk '/NA/ { for (i = 1; i <= 4; i++) { getline; } } { print }' infile 
+0

我想這個工作..但它給了我錯誤:'sed:1:「/ NA /,+ 3 d 」:預期的上下文地址「我試着用雙引號無濟於事。 – mfk534

+0

您將需要GNU版本的'sed'。我編輯過使用'awk'添加解決方案。 – Birei

+0

我不確定我正在使用的sed版本,但awk行是這樣做的。非常感謝你! – mfk534

1

更新:我以前答案可能是錯誤的,所以我有這個選擇:

nas <- apply(mydata, 1, function(x) any(is.na(x))) 
s <- apply(mydata == 0, 1, all) 
out <- which(nas) 
for (i in which(nas)) { 
    j <- i + 1 
    while (!is.na(s[j]) && s[j]) { 
    out <- c(out, j) 
    j <- j + 1 
    } 
} 
mydata2 <- mydata[-out,] 

起初我以爲你只關心NA之後的前3行,但看起來好像你想在每個NA後刪除所有連續的全零的行。

(這是我以前的答案:)

nas <- apply(mydata, 1, function(x) any(is.na(x))) 
whereToLook <- sort(which(nas) + 1:3) 
s <- apply(mydata == 0, 1, prod) 
zeros <- which(s == 1) 
whereToErase <- zeros[zeros %in% whereToLook] 
whereToErase <- c(which(nas), whereToErase) 
+0

是的 - 這個解決方案有效。非常感謝! – mfk534

1

導入到R後,你可以這樣做:

# identify the rows containing any NA's 
narows <- which(apply(mydata,1,function(x) any(is.na(x)))) 
# identify the rows containing all 0's 
zerorows <- which(apply(mydata==0,1,all)) 

# get the rows that either contain NAs, or are all 0 and are 
# within 3 of the NA rows 
rowstodelete <- c(narows, 
        intersect(
        (sapply(c(narows),function(x) seq(x,x+3))), 
        zerorows 
       ) 
       ) 

# subset mydata to only remove the NA rows + the following 3 "zero rows" 
mydata[-rowstodelete,] 
+0

這也適用 - 謝謝! – mfk534

0

這可能會爲你工作(GNU SED):

sed '/\<NA\>/!b;:a;$!N;s/\n\(0,\)\+0$//;ta;D' file 

這將刪除任何包含NA和任何後續0,...0行的行