我喜歡讓事情變得靈活。我也喜歡保持中間數據結構。所以幾乎肯定會比這更短,更高效的內存。
請注意,我使用正則表達式的靈活性來搜索(根據你使用的話相似和像)。爲了演示效果,我對輸入數據進行了一些更改。我還添加了一些邊緣情況。
另一種方法可能使用tm文本挖掘軟件包。這會給你更多的靈活性,這個解決方案的代價是一些額外的複雜性。
my.table <-
data.frame(
col1 = c(
'4.3 automatic version 1',
'3.2 manual version 2',
'2.3 version 1',
'9.0 version 6',
'maybe standard',
'or neither'
),
col2 = c(
'ite automated version 2',
'ite version 3',
'2.5 manual version 2',
'vserion auto 5',
'maybe automatic',
'for reals'
)
)
search.terms <- c("auto|automated|automatic", "manual|standard")
names(search.terms) <- c("automatic", "manual")
term.test <- function(term) {
term.pres <- apply(
my.table,
MARGIN = 1,
FUN = function(one.cell) {
any(grep(pattern = term, x = one.cell))
}
)
return(term.pres)
}
term.presence <- lapply(X = search.terms, term.test)
term.presence <- do.call(cbind.data.frame, term.presence)
names(term.presence) <- names(search.terms)
as.labels <- lapply(names(search.terms), function(one.term) {
tempcol <- tempflag <- term.presence[, one.term]
tempcol <- rep('', length(tempflag))
tempcol[tempflag] <- one.term
return(tempcol)
})
as.labels <- do.call(cbind.data.frame, as.labels)
names(as.labels) <- search.terms
labels.concat <-
apply(
as.labels,
MARGIN = 1,
FUN = function(one.row) {
temp <- unique(sort(one.row))
temp <- temp[nchar(temp) > 0]
temp <- paste(temp, sep = ", ", collapse = "; ")
return(temp)
}
)
my.table$col3 <- labels.concat
print(my.table)
這給
col1 col2 col3
1 4.3 automatic version 1 ite automated version 2 automatic
2 3.2 manual version 2 ite version 3 manual
3 2.3 version 1 2.5 manual version 2 manual
4 9.0 version 6 vserion auto 5 automatic
5 maybe standard maybe automatic automatic; manual
6 or neither for reals
>
將是二進制(僅'auto'或'manual'),或更開放式的(自動,手動,兩者都不是,這兩個,其他列3 ... ) –