計數元件

我有非數字數據的數據幀，即計數元件

Col1 <- c("a", "b","b",NA) 
Col2 <- c(NA, "a", "c", NA) 
Col3 <- c(NA,NA,"b", "a") 

dat <- data.frame(Col1, Col2, Col3) 
dat 
# Col1 Col2 Col3 
# 1 a <NA> <NA> 
# 2 b a <NA> 
# 3 b c b 
# 4 <NA> <NA> a

我要添加該計數在每行中每一個字符的出現列。我想要的數據幀，看起來像這樣

dat 
# Col1 Col2 Col3 a b c 
# 1 a <NA> <NA> 1 0 0 
# 2 b a <NA> 1 1 0 
# 3 b c b 0 2 1 
# 4 <NA> <NA> a 1 0 0

我使用的功能

f <- function(x) { 
sum(x == "a", na.rm = T)}

找到列「A」，「B」和「C」，但也有許多字符帳戶因爲我希望有人能提出更快的方法。我懷疑apply函數可以使用，但我沒有取得任何成功。

來源

2014-07-22 mikeL

您可以使用計數每table因子水平。該功能適用於每行使用apply。使用factor及其levels參數來計算也沒有在一行中表示的（可能）因子水平。在第一步中，我們找到數據可以採用的所有可能值。

levs <- unique(unlist(dat)) 
count <- t(apply(dat, 1, function(x) table(factor(x, levels = levs)))) 
cbind(dat, count) 

# Col1 Col2 Col3 a b c 
# 1 a <NA> <NA> 1 0 0 
# 2 b a <NA> 1 1 0 
# 3 b c b 0 2 1 
# 4 <NA> <NA> a 1 0 0

來源

2014-07-22 15:44:27 Henrik

我相信你可以做得更好。你爲數據集中的每一行調用'table' *和*'factor'！這似乎效率低下。 – A5C1D2H2I1M1N2O1R2T1

@AnandaMahto，非常感謝您的評論。我完全同意 - 我對此並不滿意。但這是我想到的，而不是扔掉它，我決定分享它;;） – Henrik

# your data 
Col1<-c("a", "b","b",NA) 
Col2<-c(NA, "a", "c", NA) 
Col3<-c(NA,NA,"b", "a") 

# the data frame. note you don't want the c() function, as you had above 
dat<-data.frame(Col1,Col2,Col3, stringsAsFactors=FALSE)

解決方案：

# a vector of all the values we are searching for (less NAs) 
unq_values <- unique(unlist(dat)) 
unq_values <- unq_values[!is.na(unq_values)] 

# function: for a given unique value, count matches by row 
freq_vec <- function(u) apply(dat, 1, function(x) sum(grepl(u, x))) 

# now sapply() that function, and bind to your original data.frame 
cbind(dat, sapply(unq_values, freq_vec))

產生你想要的結果：

Col1 Col2 Col3 a b c 
1 a <NA> <NA> 1 0 0 
2 b a <NA> 1 1 0 
3 b c b 0 2 1 
4 <NA> <NA> a 1 0 0

來源

2014-07-22 15:44:06 arvi1000

（編輯對網頁上的簡單/打字） – arvi1000

您還可以：

library(reshape2)  
cbind(dat,aggregate(value~Var2, melt(t(dat)), FUN=table)[,-1]) 
# Col1 Col2 Col3 a b c 
#1 a <NA> <NA> 1 0 0 
#2 b a <NA> 1 1 0 
#3 b c b 0 2 1 
#4 <NA> <NA> a 1 0 0

來源

2014-07-22 19:03:42 akrun

我可能會建議這樣的事：

cbind(dat, 
     apply(table(cbind(rn = 1:nrow(dat), 
         stack(lapply(dat, as.character)))), 
      c(1, 2), sum))

這是相當快的。 Here's a Gist with the functions I ran。這裏是結果。

fun1 is this answer，fun2 is Henrik's，fun3 is akrun's，and fun4 is arvi1000's。

library(microbenchmark) 
library(reshape2) 
microbenchmark(fun1(), fun2(), fun3(), fun4()) 
# Unit: milliseconds 
# expr  min  lq median  uq  max neval 
# fun1() 1.882373 1.981502 2.031227 2.074144 4.193716 100 
# fun2() 2.201289 2.271821 2.316432 2.346138 5.147774 100 
# fun3() 6.565937 6.821392 6.928942 7.078843 11.700034 100 
# fun4() 2.043613 2.120811 2.151803 2.206342 5.283656 100

當然，基準上四大行的數據不給的東西一個好的圖片，所以我縮放它一點點，再測試：

dat <- do.call(rbind, replicate(5000, dat, FALSE)) 
dim(dat) 
# [1] 20000  3 
system.time(fun1()) 
# user system elapsed 
# 0.657 0.004 0.662 
system.time(fun2()) 
# user system elapsed 
# 7.730 0.029 7.787 
system.time(fun3()) 
# user system elapsed 
# 16.795 0.063 16.887 
system.time(fun4()) 
# user system elapsed 
# 2.128 0.011 2.141

來源

2014-07-23 17:45:47 A5C1D2H2I1M1N2O1R2T1

好的回答（+1）！感謝時間！ – Henrik

回答

相關問題