2013-02-15 32 views
3

缺失值讀取文件我有文件名的文件=「FN」,這我讀如下:與R中

age CALCIUM CREATININE GLUCOSE 
64.3573  1.1 488 
69.9043 8.1 1.1 472 
65.6633 8.6 0.8 461 
50.3693 8.1 1.3 418 
57.0334 8.7 0.8 NEG 
81.4939  1.1 NEG 
56.954 9.8 1 
76.9298 9.1 0.8 NEG 


> tmpData = read.table(fn, header = TRUE, sep= "\t" , na.strings = c('', 'NA', '<NA>'), blank.lines.skip = TRUE) 
> tmpData 
     age CALCIUM CREATININE GLUCOSE 
1 64.3573   NA  1.1  488 
2 69.9043   8.1  1.1  472 
3 65.6633   8.6  0.8  461 
4 50.3693   8.1  1.3  418 
5 57.0334   8.7  0.8  NEG 
6 81.4939   NA  1.1  NEG 
7 56.9540   9.8  1.0 <NA> 
8 76.9298   9.1  0.8  NEG 

該文件被讀取如以上取代爲NA和< NA缺失值>。我想'葡萄糖'欄被當作因素。是否有一種簡單的方法可以將< NA>解釋爲真實NA,並將任何非數字值轉換爲NA(在此示例中,NEG轉換爲NA)

+2

如果添加「NEG」到'na.strings'會發生什麼? – joran 2013-02-15 15:30:42

+0

如果包含NEG,則工作。但對於一般字符串,它可以是任何字符序列,它有任何自動處理這種情況的讀取方法 – user1140126 2013-02-15 15:36:16

回答

4

您可以充分利用as.numeric將非數字值強制爲NA。換句話說,嘗試這樣的事情:

這是你的數據:

temp <- structure(list(age = c(64.3573, 69.9043, 65.6633, 50.3693, 57.0334, 
    81.4939, 56.954, 76.9298), CALCIUM = c(1.1, 8.1, 8.6, 8.1, 8.7, 
    1.1, 9.8, 9.1), CREATININE = c(NA, 1.1, 0.8, 1.3, 0.8, NA, 1, 
    0.8), GLUCOSE = structure(c(5L, 4L, 3L, 2L, 6L, 6L, 1L, 6L), .Label = c("", 
    "418", "461", "472", "488", "NEG"), class = "factor")), .Names = c("age", 
    "CALCIUM", "CREATININE", "GLUCOSE"), class = "data.frame", row.names = c(NA, 
    -8L)) 

而其目前的結構:

str(temp) 
# 'data.frame': 8 obs. of 4 variables: 
# $ age  : num 64.4 69.9 65.7 50.4 57 ... 
# $ CALCIUM : num 1.1 8.1 8.6 8.1 8.7 1.1 9.8 9.1 
# $ CREATININE: num NA 1.1 0.8 1.3 0.8 NA 1 0.8 
# $ GLUCOSE : Factor w/ 6 levels "","418","461",..: 5 4 3 2 6 6 1 6 

,去年列轉換爲數字,但因爲它是一個因素,我們需要先將其轉換爲字符。請注意警告。我們真的很高興。

temp$GLUCOSE <- as.numeric(as.character(temp$GLUCOSE)) 
# Warning message: 
# NAs introduced by coercion 

結果:

temp 
#  age CALCIUM CREATININE GLUCOSE 
# 1 64.3573  1.1   NA  488 
# 2 69.9043  8.1  1.1  472 
# 3 65.6633  8.6  0.8  461 
# 4 50.3693  8.1  1.3  418 
# 5 57.0334  8.7  0.8  NA 
# 6 81.4939  1.1   NA  NA 
# 7 56.9540  9.8  1.0  NA 
# 8 76.9298  9.1  0.8  NA 

爲了好玩,這裏有一個小功能,我放在一起,提供了一個可供選擇的方法:

makemeNA <- function (mydf, NAStrings, fixed = TRUE) { 
    if (!isTRUE(fixed)) { 
    mydf[] <- lapply(mydf, function(x) gsub(NAStrings, "", x)) 
    NAStrings <- "" 
    } 
    mydf[] <- lapply(mydf, function(x) type.convert(
    as.character(x), na.strings = NAStrings)) 
    mydf 
} 

此功能允許您指定正則表達式來確定應該是什麼NA價值。我沒有真正測試過它,所以使用正則表達式需要您自擔風險

使用相同的「臨時」對象上面,嘗試這些了,看的功能是什麼:

# Change anything that is just text to NA 
makemeNA(temp, "[A-Za-z]", fixed = FALSE) 
# Change any exact matches with "NEG" to NA 
makemeNA(temp, "NEG") 
# Change any matches with 3-digit integers to NA 
makemeNA(temp, "^[0-9]{3}$", fixed = FALSE)