2015-05-12 49 views
2

我有包含唯一標識符,類別和說明的數據。 下面是一個玩具數據集。如何用R中以前行的樣本數據填充非相鄰行?

prjnumber <- c(1,2,3,4,5,6,7,8,9,10) 
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA) 
description <- c("skip class", 
       "dunk on brayden", 
       "record deal", 
       "fame and fortune", 
       NA, 
       "female attention", 
       NA,NA,NA,NA) 
toy.df <- data.frame(prjnumber, category, description) 

> toy.df 
     prjnumber category  description 
    1   1 based  skip class 
    2   2 trill dunk on brayden 
    3   3  lit  record deal 
    4   4  cold fame and fortune 
    5   5  <NA>    <NA> 
    6   6  epic female attention 
    7   7  <NA>    <NA> 
    8   8  <NA>    <NA> 
    9   9  <NA>    <NA> 
    10  10  <NA>    <NA> 

我想從填充的行中隨機取樣'category'和'description'列作爲缺失數據行的填充。 最終的數據框將會完整,只會依賴包含數據的最初5行。該解決方案將保持列間相關性。 預期的輸出是:

> toy.df 
     prjnumber category  description 
    1   1 based  skip class 
    2   2 trill dunk on brayden 
    3   3  lit  record deal 
    4   4  cold fame and fortune 
    5   5  lit  record deal 
    6   6  epic female attention 
    7   7 based  skip class 
    8   8 based  skip class 
    9   9  lit  record deal 
    10  10 trill dunk on brayden 

回答

5
complete = na.omit(toy.df) 
toy.df[is.na(toy.df$category), c("category", "description")] = 
    complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE), 
      c("category", "description")] 
toy.df 
# prjnumber category  description 
# 1   1 based  skip class 
# 2   2 trill dunk on brayden 
# 3   3  lit  record deal 
# 4   4  cold fame and fortune 
# 5   5  lit  record deal 
# 6   6  epic female attention 
# 7   7  cold fame and fortune 
# 8   8 based  skip class 
# 9   9  epic female attention 
# 10  10  epic female attention 

雖然它似乎有點更簡單,如果你沒有與填寫的NA行的唯一標識符開始......

+0

確切的說,這就是爲什麼它非常棘手。我有非連續行中沒有數據的標識符插槽。 –

5

你可以嘗試基於新的信息

library(dplyr) 
toy.df %>% 
     mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3) 

,我們可能需要一個數字指標在funs使用。

toy.df %>% 
    mutate(indx= replace(row_number(), is.na(category), 
      sample(row_number()[!is.na(category)], replace=TRUE))) %>% 
    mutate_each(funs(.[indx]), 2:3) %>% 
    select(-indx) 
+0

增加了'sample',可能是需要的 – akrun

+0

......也許是'replace = TRUE'。另外值得注意的是,這並不能保持兩列之間的相關性。如果OP需要或不需要,則不完全清楚。 – Gregor

+0

@Gregor感謝您的評論。我也不確定通過閱讀這個問題。 – akrun

2

使用基礎R在時刻在單個字段中的填充,使用類似(未保留的字段之間的相關性):

fields <- c('category','description') 
for(field in fields){ 
    missings <- is.na(toy.df[[field]]) 
    toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T) 
} 

並填充它們同時(保留字段之間)的相關使用類似:

missings <- apply(toy.df[,fields], 
        1, 
        function(x)any(is.na(x))) 

toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings), 
                  sum(missings), 
                  T),] 

,當然,避免了隱含的循環在apply(x,1,fun),你可以使用:

rowAny <- function(x) rowSums(x) > 0 
missings <- rowAny(toy.df[,fields])