我從missForest
包使用的prodNA
。
我的功能是遵循
fn.df.add.NA <- function(df, var.name, prop.of.missing) {
df.buf <- subset(df, select=c(var.name)) # Select variable
require(missForest, quietly = T)
df.buf <- prodNA(x = df.buf, prop.of.missing) # chage original value to NA in casual sequence
detach("package:missForest", unload=TRUE)
df.col.order <- colnames(x = df) # save the column order
df <- subset(df, select=-c(which(colnames(df)==var.name))) # drop the variable with no NAs
df <- cbind(df, df.buf) # add the column with NA
df <- subset(df, select=df.col.order) # restore the original order sequence
return(df)
}
它允許根據給定的比例改變到NAS觀察的隨機數。
因爲prodNA函數將NA應用於所有data.frame列我已經使用「緩衝區」數據結構以便返回輸入data.frame的相同數據結構。也許一些讀者可能會建議一個更優雅的方式。
在每一個方式,你可以做這個測試
set.seed(1)
df <- data.frame(a = as.numeric(runif(n = 100, min = 1, max = 100)),
b = as.numeric(runif(n = 100, min = 201, max = 300)),
c = as.numeric(runif(n = 100, min = 301, max = 400)))
summary(df)
a b c
Min. : 2.326 Min. :202.3 Min. :303.8
1st Qu.:32.985 1st Qu.:229.2 1st Qu.:319.8
Median :49.293 Median :252.3 Median :338.4
Mean :52.267 Mean :252.2 Mean :344.1
3rd Qu.:76.952 3rd Qu.:273.3 3rd Qu.:364.0
Max. :99.199 Max. :299.3 Max. :398.2
df <- fn.df.add.NA(df = df, var.name = "a", prop.of.missing = .1)
df <- fn.df.add.NA(df = df, var.name = "b", prop.of.missing = .2)
df <- fn.df.add.NA(df = df, var.name = "c", prop.of.missing = .3)
summary(df)
a b c
Min. : 2.326 Min. :202.3 Min. :303.8
1st Qu.:30.628 1st Qu.:229.2 1st Qu.:319.2
Median :48.202 Median :252.3 Median :342.2
Mean :50.247 Mean :252.5 Mean :345.4
3rd Qu.:71.504 3rd Qu.:273.3 3rd Qu.:369.3
Max. :99.199 Max. :299.3 Max. :396.2
NA's :10 NA's :20 NA's :30
感謝克里斯編輯的程序代碼。 –
另請參閱此圖形代表:http://stackoverflow.com/a/28368161/3871924 – agenis