R /使用矢量化來檢查列中是否存在df

我已經定義了以下函數來檢查數據框是否包含多個列，如果沒有，則包含它們。R /使用矢量化來檢查列中是否存在df

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5)) 

test <- CheckFullCohorts(test)

問題：我怎樣才能使函數（df <- foo(...）更靈活的硬編碼部分通過使用列名的載體來檢查

CheckFullCohorts <- function(df) { 
    # Checks if year/cohort df contains all necessary columns 
    # Args: 
    # df: year/cohort df 

    # Return: 
    # df: df, corrected if necessary 

    foo <- function(mydf, mystring) { 
    if(!(mystring %in% names(mydf))) { 
     mydf[mystring] <- 0 
    } 
    mydf 
    } 

    df <- foo(df, "age.16.20") 
    df <- foo(df, "age.21.24") 
    df <- foo(df, "age.25.49") 
    df <- foo(df, "age.50.57") 
    df <- foo(df, "age.58.65") 
    df <- foo(df, "age.66.70") 

    df 
}

如下我會用這個功能？

我已經試過：

CheckFullCohorts <- function(df, col.list) { 
    # Checks if year/cohort df contains all necessary columns 
    # Args: 
    # df: year/cohort df 
    # col.list: named list of columns 

    # Return: 
    # df: df, corrected if necessary 

    foo <- function(mydf, mystring) { 
    if(!(mystring %in% names(mydf))) { 
     mydf[mystring] <- 0 
    } 
    mydf 
    } 

    df <- sapply(df, foo, mystring = col.list) 

    df 
}

...但我得到一個錯誤的結果：

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5)) 
test <- CheckFullCohorts(test, c("age.16.20", "age.20.25")) 

Warning messages: 
1: In if (!(mystring %in% names(mydf))) { : 
    the condition has length > 1 and only the first element will be used 
2: In `[<-.factor`(`*tmp*`, mystring, value = 0) : 
    invalid factor level, NA generated 
3: In if (!(mystring %in% names(mydf))) { : 
    the condition has length > 1 and only the first element will be used 
4: In `[<-.factor`(`*tmp*`, mystring, value = 0) : 
    invalid factor level, NA generated 
> test 
      age.16.20 lorem 
      "x"  "y" 
      "x"  "y" 
      "x"  "y" 
      "x"  "y" 
      "x"  "y" 
age.16.20 NA  NA 
age.20.25 NA  NA

來源

2016-02-16 Timm S.

如何將字符串向量'S'傳遞給'CheckFullCohort'，然後用'for（s in s）{df < - foo（df，s）}'替換相關行。 –

當然，這會工作。這是否意味着循環比矢量化解決方案更有效的情況之一？如果是這樣，我仍然很想知道我在'sapply'上做錯了什麼。 –

循環是否有效取決於數據框是否在每次交互時被複制，並且我不知道這是否是這種情況。但是關於循環效率不高的討論經常被誇大了：這一步是你的代碼中的瓶頸嗎？如果不是，那不是你應該花費能源優化的地方。至於'sapply'，好問題 - 我傾向於使用'plyr'來做這些事情，界面對我來說更有意義。 PS，@羅蘭德的答案下面的作品也不需要功能！ –

您可以輕鬆地矢量化這樣的：

test <- data.frame(age.16.20 = rep("x", 5), lorem = rep("y", 5)) 
musthaves <- c("age.16.20", "age.21.24", "age.25.49", 
       "age.50.57", "age.58.65", "age.66.70") 

test[musthaves[!(musthaves %in% names(test))]] <- 0 
# age.16.20 lorem age.21.24 age.25.49 age.50.57 age.58.65 age.66.70 
#1   x  y   0   0   0   0   0 
#2   x  y   0   0   0   0   0 
#3   x  y   0   0   0   0   0 
#4   x  y   0   0   0   0   0 
#5   x  y   0   0   0   0   0

然而，通常NA值將比0更合適。

來源

2016-02-16 15:33:53 Roland

「NA」與「0」是一個好點。 –

哇，這真的很優雅。一般來說，我同意NA評論 - 在這個特定的情況下，我正在尋找0。 –

R /使用矢量化來檢查列中是否存在df

回答

相關問題