如何用R中以前行的樣本數據填充非相鄰行？

我有包含唯一標識符，類別和說明的數據。下面是一個玩具數據集。如何用R中以前行的樣本數據填充非相鄰行？

prjnumber <- c(1,2,3,4,5,6,7,8,9,10) 
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA) 
description <- c("skip class", 
       "dunk on brayden", 
       "record deal", 
       "fame and fortune", 
       NA, 
       "female attention", 
       NA,NA,NA,NA) 
toy.df <- data.frame(prjnumber, category, description) 

> toy.df 
     prjnumber category  description 
    1   1 based  skip class 
    2   2 trill dunk on brayden 
    3   3  lit  record deal 
    4   4  cold fame and fortune 
    5   5  <NA>    <NA> 
    6   6  epic female attention 
    7   7  <NA>    <NA> 
    8   8  <NA>    <NA> 
    9   9  <NA>    <NA> 
    10  10  <NA>    <NA>

我想從填充的行中隨機取樣'category'和'description'列作爲缺失數據行的填充。最終的數據框將會完整，只會依賴包含數據的最初5行。該解決方案將保持列間相關性。預期的輸出是：

> toy.df 
     prjnumber category  description 
    1   1 based  skip class 
    2   2 trill dunk on brayden 
    3   3  lit  record deal 
    4   4  cold fame and fortune 
    5   5  lit  record deal 
    6   6  epic female attention 
    7   7 based  skip class 
    8   8 based  skip class 
    9   9  lit  record deal 
    10  10 trill dunk on brayden

來源

2015-05-12 Jason Matney

complete = na.omit(toy.df) 
toy.df[is.na(toy.df$category), c("category", "description")] = 
    complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE), 
      c("category", "description")] 
toy.df 
# prjnumber category  description 
# 1   1 based  skip class 
# 2   2 trill dunk on brayden 
# 3   3  lit  record deal 
# 4   4  cold fame and fortune 
# 5   5  lit  record deal 
# 6   6  epic female attention 
# 7   7  cold fame and fortune 
# 8   8 based  skip class 
# 9   9  epic female attention 
# 10  10  epic female attention

雖然它似乎有點更簡單，如果你沒有與填寫的NA行的唯一標識符開始......

來源

2015-05-12 17:48:19 Gregor

確切的說，這就是爲什麼它非常棘手。我有非連續行中沒有數據的標識符插槽。 –

你可以嘗試基於新的信息

library(dplyr) 
toy.df %>% 
     mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3)

，我們可能需要一個數字指標在funs使用。

toy.df %>% 
    mutate(indx= replace(row_number(), is.na(category), 
      sample(row_number()[!is.na(category)], replace=TRUE))) %>% 
    mutate_each(funs(.[indx]), 2:3) %>% 
    select(-indx)

來源

2015-05-12 17:55:06 akrun

增加了'sample'，可能是需要的 – akrun

......也許是'replace = TRUE'。另外值得注意的是，這並不能保持兩列之間的相關性。如果OP需要或不需要，則不完全清楚。 – Gregor

@Gregor感謝您的評論。我也不確定通過閱讀這個問題。 – akrun

使用基礎R在時刻在單個字段中的填充，使用類似（未保留的字段之間的相關性）：

fields <- c('category','description') 
for(field in fields){ 
    missings <- is.na(toy.df[[field]]) 
    toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T) 
}

並填充它們同時（保留字段之間）的相關使用類似：

missings <- apply(toy.df[,fields], 
        1, 
        function(x)any(is.na(x))) 

toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings), 
                  sum(missings), 
                  T),]

，當然，避免了隱含的循環在apply(x,1,fun)，你可以使用：

rowAny <- function(x) rowSums(x) > 0 
missings <- rowAny(toy.df[,fields])

來源

2015-05-12 17:55:08 Jthorpe

如何用R中以前行的樣本數據填充非相鄰行？

回答

相關問題