2016-07-31 56 views
2

我有其中每行都包含一個人的性別和體重(磅)數據:根據與另一列匹配的行子集,用均值替換NA?

genders <- c("FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "MALE", "MALE", "MALE", "MALE") 
weights <- c(110.0, 120.0, 112.0, NA, NA, 190.0, 202.0, 195.0, NA) 

df <- data.frame(gender=genders, weight=weights) 
df 
# gender weight 
# 1 FEMALE 110 
# 2 FEMALE 120 
# 3 FEMALE 112 
# 4 FEMALE  NA 
# 5 FEMALE  NA 
# 6 MALE 190 
# 7 MALE 202 
# 8 MALE 195 
# 9 MALE  NA 

對於在weight列具有NA的每一行,我願與weight替換/推諉的NA的意思,但平均值只應使用與具有NA的行相同的gender值的行進行計算。

具體而言,行4和5具有FEMALE的gender和NA的weight。我想用與女性的gender相匹配的行子集計算的平均值weight代替NA。在這種情況下,平均值將是從其他行1,2和3的(110 + 120 + 112)/3=114.0。

同樣,我想用行的平均值MALE gender的權重。

我試過下面的命令,但它取代了NA,平均體重超過,所有兩個性別的用戶,這不是我想要的。

df$weight[is.na(df$weight)] <- mean(subset(df, gender=df$gender)$weight, na.rm=T) 
df 
# gender weight 
# 1 FEMALE 110.0000 
# 2 FEMALE 120.0000 
# 3 FEMALE 112.0000 
# 4 FEMALE 154.8333 
# 5 FEMALE 154.8333 
# 6 MALE 190.0000 
# 7 MALE 202.0000 
# 8 MALE 195.0000 
# 9 MALE 154.8333 

我搜索的其他問題,但他們都不太相同的問題,因爲我的:

Replace NA with mean matching the same ID

How to replace NA with mean by subset in R (impute with plyr?)

How to replace NA values in a table for selected columns? data.frame, data.table

回答

6

你可以使用ave()replace()(或標準手冊替代)。

df$weight <- with(df, ave(weight, gender, 
    FUN = function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)))) 

這給

gender weight 
1 FEMALE 110.0000 
2 FEMALE 120.0000 
3 FEMALE 112.0000 
4 FEMALE 114.0000 
5 FEMALE 114.0000 
6 MALE 190.0000 
7 MALE 202.0000 
8 MALE 195.0000 
9 MALE 195.6667 
+0

謝謝。簡單的答案,沒有額外的包正是我在找的東西。 ave()函數看起來非常強大。 – stackoverflowuser2010

3

你可以按gender分組您的數據幀,然後計算重量的平均值並替換NAifelse聲明,dplyr,則可能是:

library(dplyr) 
df %>% 
     group_by(gender) %>% 
     mutate(weight = ifelse(is.na(weight), mean(weight, na.rm = T), weight)) 

# Source: local data frame [9 x 2] 
# Groups: gender [2] 

# gender weight 
# <fctr> <dbl> 
# 1 FEMALE 110.0000 
# 2 FEMALE 120.0000 
# 3 FEMALE 112.0000 
# 4 FEMALE 114.0000 
# 5 FEMALE 114.0000 
# 6 MALE 190.0000 
# 7 MALE 202.0000 
# 8 MALE 195.0000 
# 9 MALE 195.6667 
+0

新的'coalesce'功能以及這裏適合。 – alistaire

+0

@alistaire看起來非常有用和方便。 – Psidom

2

使用基礎R這似乎是你在找什麼:

df$weight[df$gender=="FEMALE" & is.na(df$weight)] <- mean(df$weight[df$gender=="FEMALE"], na.rm=TRUE) 
df$weight[df$gender=="MALE" & is.na(df$weight)] <- mean(df$weight[df$gender=="MALE"], na.rm=TRUE) 

> df 
    gender weight 
1 FEMALE 110.0000 
2 FEMALE 120.0000 
3 FEMALE 112.0000 
4 FEMALE 114.0000 
5 FEMALE 114.0000 
6 MALE 190.0000 
7 MALE 202.0000 
8 MALE 195.0000 
9 MALE 195.6667 
+0

這是非常手動的方法。他們將如何使用兩個以上的團隊?請參閱評論中更廣義的方法 –

+0

有沒有一種方法可以在沒有硬編碼「女性」和「男性」的情況下做到這一點?列中的數據可能有幾十個唯一值。 – stackoverflowuser2010

+0

@DavidArenburg好點。看起來理查德斯克裏文的方法更好,並且在有幾個獨特價值的情況下工作。 – Warner

1

這可以使用na.aggregatezoo輕鬆完成。將'data.frame'轉換爲'data.table'(setDT(df)),按'性別'進行分組,我們將na.aggregate應用於'權重'以用mean值替換NA元素。默認情況下,na.aggregate返回mean,但我們也可以更改FUN參數以獲得mediansum等。

library(data.table) 
library(zoo) 
setDT(df)[, weight := na.aggregate(weight) , by = gender] 

或者與avebase R

with(df, ave(weight, gender, FUN = na.aggregate)) 
#[1] 110.0000 120.0000 112.0000 114.0000 114.0000 190.0000 202.0000 195.0000 195.6667