2017-08-23 76 views
0

數據幀之間NA取代我有這樣的數據df.1R:基於索引

month a  b   c     
    1 0 0.000000000 0.000000000 
    2 0 0.000000000 0.001503194 
    3 0 0.000000000 0.000000000 
    4 0 0.000000000 0.000000000 
    5 0 0.000000000 0.000000000 
    6 0 0.000000000 0.000000000 
    7 0 0.000000000 0.000000000 
    8 0 0.000000000 0.000000000 
    9 0 0.000000000 0.000000000 
    10 0 0.000000000 0.000000000 
    11 NA  NA   NA 
    12 NA  NA   NA 
    1 0 0.000000000 0.000000000 
    2 0 0.001537279 0.006917756 
    3 0 0.000000000 0.003669725 
    4 0 0.000000000 0.000000000 
    5 0 0.000000000 0.000000000 
    6 0 0.000000000 0.000000000 
    7 0 0.000000000 0.000000000 
    8 0 0.000000000 0.000000000 
    9 0 0.000000000 0.000000000 
    10 0 0.000000000 0.000000000 
    11 0 0.000000000 0.013513514 
    12 NA  NA   NA 

,並將該數據df.2

month  a   b   c 
    1 0.03842077 0.002266291 0.000000000 
    2 0.01359501 0.001027937 0.000000000 
    3 0.08631519 0.008732519 0.001376147 
    4 0.26564710 0.083635347 0.019053692 
    5 0.34839088 0.152203121 0.021010075 
    6 0.31767367 0.152029019 0.029397773 
    7 0.31507761 0.110973916 0.023445471 
    8 0.29773872 0.096458381 0.026745770 
    9 0.31226976 0.109342562 0.023996392 
    10 0.23841220 0.081582743 0.021674228 
    11 0.04379016 0.003519300 0.000000000 
    12 0.02244389 0.002493766 0.000000000 

我向替補多值NA(並且僅NA)當df.1 [,2:4]的值在df.2 [,2:4]時列1中的索引(month)相同。我試着用這個代碼:

res_new <- data.frame(matrix(nrow=nrow(df.1),ncol=3)) 
for (n in 1:12){ 
res_new <- data.frame(ifelse(is.na(df.1[which(df.1[,1] == n),2:4])==TRUE,df.2[which(df.2[,1] == n),2:4],df.1[,n])) 

    } 

但結果卻是一個很大的新矩陣,其中在df.1每個NA值取代的,在df.2

所有的價值如何能做到這一點? (我的實際數據幀要大得多)

回答

1

假設您有完整的行缺少要填寫的值,可以使用whichmatch這兩個步驟完成。

# find the location of the missing rows in df 
missRows <- which(!complete.cases(df.1)) 
# fill in missing rows with rows in df.2 with matching months 
df.1[missRows, ] <- df.2[match(df.1$month[missRows], df.2$month, nomatch=0),] 

請注意,缺少的行用!complete.cases標識。此外,使用nomatch = 0參數來忽略不匹配的實例。

0

也許不是最好的方法,但是這樣的一些方法可以工作!

df1 <- data.frame(month = 1:12, 
        a = c(rep(1, 10), NA, NA), 
        b = c(rep(2, 11), NA)) 

df2 <- data.frame(month = 1:12, 
        a = rnorm(12), 
        b = rnorm(12)) 

# first, merge both data frame by the key in this case the month 
new_df <- merge(df1, df2, by = "month") 

# then use a vectorize operation with ifelse function 
new_df$imp_a <- ifelse(!is.na(new_df$a.x), new_df$a.x, new_df$a.y) 

# then you need to drop the temporal columns or make a subset of the 
# new imputed columns generated 
new_df 

可能創建一個函數爲ifelse一步,如果你需要推諉多列,就像這樣:

impute <- function(df, col1, col2) { 
# impute col1 NA by col2 values creating a new column 
new_name <- paste("new", col1, by = "_") 
df[[new_name]] <- ifelse(!is.na(df[[col1]]), df[[col1]], df[[col2]]) 
df 
} 

impute(new_df, "a.x", "a.y") 
0

考慮到你有一個更大的數據幀,我會盡量避免合併表。您可以使用ifelse完成工作。

month <- c(1:12, 1:12) 
a <- c(rep(0,10), NA, NA, rep(0,11), NA) 
b <- c(rep(0,10), NA, NA, 0,.0015,rep(0,9), NA) 
c <- c(0,.0015,rep(0,8), NA, NA, 0,.0069, .0036,rep(0,7), .0135, NA) 
df.1 <- data.frame(month,a,b,c) 

df.2 <- data.frame(month=c(1:12), a=rep(1,12), b=rep(2,12), c=rep(3,12)) 

df.1$a <- ifelse(is.na(df.1$a), df.2$a[match(df.1$month, df.2$month)], df.1$a) 
df.1$b <- ifelse(is.na(df.1$b), df.2$b[match(df.1$month, df.2$month)], df.1$b) 
df.1$c <- ifelse(is.na(df.1$c), df.2$c[match(df.1$month, df.2$month)], df.1$c) 

> df.1 
    month a  b  c 
1  1 0 0.0000 0.0000 
2  2 0 0.0000 0.0015 
3  3 0 0.0000 0.0000 
4  4 0 0.0000 0.0000 
5  5 0 0.0000 0.0000 
6  6 0 0.0000 0.0000 
7  7 0 0.0000 0.0000 
8  8 0 0.0000 0.0000 
9  9 0 0.0000 0.0000 
10 10 0 0.0000 0.0000 
11 11 1 2.0000 3.0000 
12 12 1 2.0000 3.0000 
13  1 0 0.0000 0.0000 
14  2 0 0.0015 0.0069 
15  3 0 0.0000 0.0036 
16  4 0 0.0000 0.0000 
17  5 0 0.0000 0.0000 
18  6 0 0.0000 0.0000 
19  7 0 0.0000 0.0000 
20  8 0 0.0000 0.0000 
21  9 0 0.0000 0.0000 
22 10 0 0.0000 0.0000 
23 11 0 0.0000 0.0135 
24 12 1 2.0000 3.0000 
+0

那麼,這樣就可以用循環來做到這一點嗎? (我認爲...) – skylobo

+0

@skylobo是的,這是絕對有可能的。你可以使用這個(假設df.1和df之間的列匹配。2): (i in 1:NCOL(df.1))df.1 [,i] < - ifelse(is.na(df.1 [,i]),df.2 [,i] [match(df.1 $ month,df.2 $ month)],df.1 [,i]) } – user108363

1

的數據的第一12行:

df.1 <- data.frame(
    month = 1:12, 
    a = c(rep(0, 10), NA, NA), 
    b = c(rep(0, 10), NA, NA), 
    c = c(0, 0.001503194, rep(0, 8), NA, NA) 
) 

df.2 <- data.frame(
    month = 1:12, 
    a = c(0.03842077, 0.01359501, 0.08631519, 0.2656471, 0.34839088, 0.31767367, 
     0.31507761, 0.29773872, 0.31226976, 0.2384122, 0.04379016, 0.02244389), 
    b = c(0.002266291, 0.001027937, 0.008732519, 0.083635347, 0.152203121, 
     0.152029019, 0.110973916, 0.096458381, 0.109342562, 0.081582743, 
     0.0035193, 0.002493766), 
    c = c(0, 0, 0.001376147, 0.019053692, 0.021010075, 0.029397773, 0.023445471, 
     0.02674577, 0.023996392, 0.021674228, 0, 0) 
) 

這種解決方案僅允許某些列在一行中是NA。大數據可能需要一些時間,但可以完成工作。

for (row in 1:nrow(df.1)) { 
    for (col in names(df.1)[-1]) { 
    if (is.na(df.1[row, col]) && df.1[row, "month"] == df.2[row, "month"]) { 
     df.1[row, col] <- df.2[row, col] 
    } 
    } 
} 
df.1 

    month   a   b   c 
1  1 0.00000000 0.000000000 0.000000000 
2  2 0.00000000 0.000000000 0.001503194 
3  3 0.00000000 0.000000000 0.000000000 
4  4 0.00000000 0.000000000 0.000000000 
5  5 0.00000000 0.000000000 0.000000000 
6  6 0.00000000 0.000000000 0.000000000 
7  7 0.00000000 0.000000000 0.000000000 
8  8 0.00000000 0.000000000 0.000000000 
9  9 0.00000000 0.000000000 0.000000000 
10 10 0.00000000 0.000000000 0.000000000 
11 11 0.04379016 0.003519300 0.000000000 
12 12 0.02244389 0.002493766 0.000000000 

說明

使用雙循環中,我們檢查的ac列的每一個元素。如果該元素不是NA,我們繼續下一個。否則,我們檢查df.2中同一行中的月份是否相同,如果是TRUE,我們用df.2中的對應元素替換該元素。

+0

my R reply :.錯誤if(is.na(prop_taglie [row,col])&& prop_taglie [row,「mesi」] ==:缺少值,其中TRUE/FALSE需要 – skylobo

+0

這可能發生在兩個data.frames有不同數量的行。 。 – snoram