我已經定義了以下函數來檢查數據框是否包含多個列,如果沒有,則包含它們。R /使用矢量化來檢查列中是否存在df
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test)
問題:我怎樣才能使函數(df <- foo(...
)更靈活的硬編碼部分通過使用列名的載體來檢查
CheckFullCohorts <- function(df) {
# Checks if year/cohort df contains all necessary columns
# Args:
# df: year/cohort df
# Return:
# df: df, corrected if necessary
foo <- function(mydf, mystring) {
if(!(mystring %in% names(mydf))) {
mydf[mystring] <- 0
}
mydf
}
df <- foo(df, "age.16.20")
df <- foo(df, "age.21.24")
df <- foo(df, "age.25.49")
df <- foo(df, "age.50.57")
df <- foo(df, "age.58.65")
df <- foo(df, "age.66.70")
df
}
如下我會用這個功能?
我已經試過:
CheckFullCohorts <- function(df, col.list) {
# Checks if year/cohort df contains all necessary columns
# Args:
# df: year/cohort df
# col.list: named list of columns
# Return:
# df: df, corrected if necessary
foo <- function(mydf, mystring) {
if(!(mystring %in% names(mydf))) {
mydf[mystring] <- 0
}
mydf
}
df <- sapply(df, foo, mystring = col.list)
df
}
...但我得到一個錯誤的結果:
test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5))
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25"))
Warning messages:
1: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
3: In if (!(mystring %in% names(mydf))) { :
the condition has length > 1 and only the first element will be used
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) :
invalid factor level, NA generated
> test
age.16.20 lorem
"x" "y"
"x" "y"
"x" "y"
"x" "y"
"x" "y"
age.16.20 NA NA
age.20.25 NA NA
如何將字符串向量'S'傳遞給'CheckFullCohort',然後用'for(s in s){df < - foo(df,s)}'替換相關行。 –
當然,這會工作。這是否意味着循環比矢量化解決方案更有效的情況之一?如果是這樣,我仍然很想知道我在'sapply'上做錯了什麼。 –
循環是否有效取決於數據框是否在每次交互時被複制,並且我不知道這是否是這種情況。但是關於循環效率不高的討論經常被誇大了:這一步是你的代碼中的瓶頸嗎?如果不是,那不是你應該花費能源優化的地方。至於'sapply',好問題 - 我傾向於使用'plyr'來做這些事情,界面對我來說更有意義。 PS,@羅蘭德的答案下面的作品也不需要功能! –