řdata.table除去其中的一列被複制行如果另一列是NA

這裏是data.table一個例子řdata.table除去其中的一列被複制行如果另一列是NA

dt <- data.table(col1 = c('A', 'A', 'B', 'C', 'C', 'D'), col2 = c(NA, 'dog', 'cat', 'jeep', 'porsch', NA)) 

    col1 col2 
1: A  NA 
2: A dog 
3: B cat 
4: C jeep 
5: C porsch 
6: D  NA

我想如果COL2是NA並且具有非以除去其中COL1被複制的行-NA值在另一行中。 AKA組由col1組成，如果組有多於一行並且其中一個是NA，則刪除它。這將是dt結果：

col1 col2 
2: A dog 
3: B cat 
4: C jeep 
5: C porsch 
6: D  NA

我嘗試這樣做：

dt[, list(col2 = ifelse(length(col1>1), col2[!is.na(col2)], col2)), by=col1] 

    col1 col2 
1: A dog 
2: B cat 
3: C jeep 
4: D NA

我缺少什麼？謝謝

來源

2017-08-07 alexvpickering

試圖找到所有NA案件在羣裏也有一個非NA值，然後刪除這些行：

dt[-dt[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1] 
# col1 col2 
#1: A dog 
#2: B cat 
#3: C jeep 
#4: C porsch 
#5: D  NA

似乎更快，但我敢肯定有人會用一個不久更快版本露面：

set.seed(1) 
dt2 <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE)) 
system.time(dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]) 
# user system elapsed 
# 1.49 0.02 1.51 
system.time(dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1]) 
# user system elapsed 
# 4.49 0.04 4.54

來源

2017-08-07 23:40:41 thelatemail

你錯過了括號（可能是錯字），我想應該是length(col1) > 1;並且在標量條件下也使用ifelse，該條件不會像您期望的那樣工作（僅從矢量的第一個元素被拾取）;如果你想從當有非NAS上組中刪除NA值，你可以使用if/else：

dt[, .(col2 = if(all(is.na(col2))) NA_character_ else na.omit(col2)), by = col1] 

# col1 col2 
#1: A dog 
#2: B cat 
#3: C jeep 
#4: C porsch 
#5: D  NA

來源

2017-08-07 23:30:29 Psidom

group by col1，那麼如果group有多於一行並且其中一個是NA，則刪除它。

使用反連接：

dt[!dt[, if (.N > 1L) .SD[NA_integer_], by=col1], on=names(dt)] 

    col1 col2 
1: A dog 
2: B cat 
3: C jeep 
4: C porsch 
5: D  NA

基準從@thela，但假設沒有（充分）愚弄的原始數據：

set.seed(1) 
dt2a <- data.table(col1=sample(1:5e5,5e6,replace=TRUE), col2=sample(c(1:8,NA),5e6,replace=TRUE)) 
dt2 = unique(dt2a) 

system.time(res_thela <- dt2[-dt2[, .I[any(!is.na(col2)) & is.na(col2)], by=col1]$V1]) 
# user system elapsed 
# 0.73 0.06 0.81 

system.time(res_psidom <- dt2[, .(col2 = if(all(is.na(col2))) NA_integer_ else na.omit(col2)), by = col1]) 
# user system elapsed 
# 2.86 0.03 2.89 

system.time(res <- dt2[!dt2[, .N, by=col1][N > 1L, !"N"][, col2 := dt2$col2[NA_integer_]], on=names(dt2)]) 
# user system elapsed 
# 0.39 0.01 0.41 

fsetequal(res, res_thela) # TRUE 
fsetequal(res, res_psidom) # TRUE

我改變一點點了速度。有了a having= argument，這可能會變得更快，更清晰。

來源

2017-08-08 00:45:54 Frank

řdata.table除去其中的一列被複制行如果另一列是NA

回答

相關問題