我想搜索一個變量的內容placement，並根據所尋找的模式創建一個新變量term。小例子...在R中，使用dplyr的mutate（）創建一個新的變量，以另一個內容爲條件

首先我創建了一個搜索模式功能：

calcterm <- function(x){ # calcterm takes a column argument to read 
    print(x) 
    if (x %in% '_fa_') { 
      return ('fall') 
    } else if (x %in% '_wi_') { 
      return('winter') 
    } else if (x %in% '_sp_') { 
      return('spring') 
    } else {return('summer') 
    } 
}

我將創建一個小的數據幀，我會再傳給dplyr的tbl_df：

placement <- c('pn_ds_ms_fa_th_hrs','pn_ds_ms_wi_th_hrs' ,'pn_ds_ms_wi_th_hrs') 
hours <- c(1230, NA, 34) 

d <- data.frame(placement, hours) 

library(dplyr) 

d <- tbl_df(d)

>d 
    Source: local data frame [3 x 2] 

     placement hours 
      (fctr) (dbl) 
1 pn_ds_ms_fa_th_hrs 1230 
2 pn_ds_ms_wi_th_hrs NA 
3 pn_ds_ms_wi_th_hrs 34

接下來，我用發生變異來實現我的功能：現在

表d的出現。目標是讀取placement的內容，並創建一個新變量，該變量將根據placement列中找到的模式生成值fall,winter,spring或summer。

d %>% mutate(term=calcterm(placement))

輸出給我留下了

[1] pn_ds_ms_fa_th_hrs pn_ds_ms_wi_th_hrs pn_ds_ms_wi_th_hrs 
Levels: pn_ds_ms_fa_th_hrs pn_ds_ms_wi_th_hrs 
Source: local data frame [3 x 3] 

     placement hours term 
      (fctr) (dbl) (chr) 
1 pn_ds_ms_fa_th_hrs 1230 summer 
2 pn_ds_ms_wi_th_hrs NA summer 
3 pn_ds_ms_wi_th_hrs 34 summer 

Warning messages: 
    1: In if (x %in% "_fa_") { : 
     the condition has length > 1 and only the first element will be used 
    2: In if (x %in% "_wi_") { : 
     the condition has length > 1 and only the first element will be used 
    3: In if (x %in% "_sp_") { : 
     the condition has length > 1 and only the first element will be used

因此，很明顯，我寫的東西錯在一開始...也許%in%可以交換爲grep的格局？我不知道如何解決這個問題。

謝謝。

UPDATE

基於下面的反應，我更新這個跟我完全管系列中展示我如何實現這一點。我正在使用的數據是「廣泛的」，我首先翻轉它的軸，並從這些名稱中提取有用的信息。這個例子的工作---但在我自己的數據，當我到了發生變異（）步，我得到的消息：Error: invalid subscript type 'list'

值得注意的是，後總結（）我得到警告：

Warning message: 
attributes are not identical across measure variables; they will be dropped

也許這與下一步失敗有關？由於警告沒有出現在我的例子中？

set.seed(1) 

dfmaker <- function() { 
     setNames(
       data.frame(
         replicate(5, sample(c(NA, 300:500), 4, TRUE), FALSE)), 
       c('pn_ds_ms_fa_th_hrs','rn_ds_ms_wi_th_stu' ,'adn_ds_ms_wi_th_hrs','pn_ds_ms_wi_th_hrs' ,'rn_bsn_ds_ms_wi_th_hrs')) 
} 


d <- dfmaker() 

library(dplyr) 

d <- tbl_df(d) 

grepl_vec_pattern = Vectorize(grepl, 'pattern') 

calcterm = function(s) { 
     require(pryr) 
     s = as.character(s) 
     grepped_patterns = grepl_vec_pattern(s, pattern = c('_sp', '_su', '_fa', '_wi')) 
     stopifnot(any(rowSums(grepped_patterns) == 1)) # Ensure that there is exactly one match 
     reduce_to_colname_with_true = apply(grepped_patterns, 1, compose(names, which)) 
     lut_table = c('_sp' = 'spring', '_su' = 'summer', '_fa' = 'fall', '_wi' = 'winter') 
     lut_table[reduce_to_colname_with_true] 
} 

select(d, matches("^pn_|^adn_|^bsn_"), -starts_with("rn_bsn")) %>% # all the pn, adn, bsn programs, for all information 
     select(contains("_hrs")) %>% # takes out just the hours 
     gather(placement, hours) %>% # flip it! 
     group_by(placement) %>% # gather all the schools into a single observation (replicated placement values at this point) 
     summarise(sumHours = sum(hours, na.rm=T)) %>% 
     mutate(term = calcterm(placement))

來源

2016-02-22 M. Elliott

'％in％'是爲了與regex完全匹配。而'mutate'沒有做任何特殊的事情，這在基本R中是無法做到的，所以在這個操作中完全不需要'dplyr'。 –

你也可以在Excel中完成所有這些操作，這並不意味着你不應該使用R. OP會問'dplyr'中的問題，回答問題或者不回答問題。這對'dplyr'完全有效。 –

@PaulHiemstra這個問題的標題是「*按順序使用dplyr的mutate（）...等等*」，而不是「*如何找到匹配......」等。我想說的是，爲了解決這個問題，你不應該關注如何使用'dplyr :: mutate'（具體的工具），因爲沒有什麼特別之處，而應該試着把重點放在問題上本身。 –

一個簡單而非常有效的方法可能是創建一個簡單的查找/模式的載體，然後結合（非常有效）stringi::stri_detect_fixed與data.table。這個解決方案應該很好地進行縮放，甚至對龐大的數據集

library(stringi) 
library(data.table) 
Lookup <- c("fall", "winter", "spring") 
Patterns <- c("fa", "wi", "sp") 
setDT(d)[, term := Lookup[stri_detect_fixed(placement, Patterns)], by = placement] 
d[is.na(term), term := "summer"] 
d 
#    placement hours term 
# 1: pn_ds_ms_fa_th_hrs 1230 fall 
# 2: pn_ds_ms_wi_th_hrs NA winter 
# 3: pn_ds_ms_wi_th_hrs 34 winter

如果我們堅持dplyr，我們需要創建一個輔助函數來處理這種情況時，沒有發現匹配（東西data.table habdles自動）

f <- function(x, Lookup, Patterns) { 
    temp <- Lookup[stri_detect_fixed(x[1L], Patterns)] 
    if(!length(temp)) return("summer") 
    temp 
} 

d %>% 
    group_by(placement) %>% 
    mutate(term = f(placement, Lookup, Patterns)) 

# Source: local data frame [3 x 3] 
# Groups: placement [2] 
# 
#   placement hours term 
#    (fctr) (dbl) (chr) 
# 1 pn_ds_ms_fa_th_hrs 1230 fall 
# 2 pn_ds_ms_wi_th_hrs NA winter 
# 3 pn_ds_ms_wi_th_hrs 34 winter

來源

2016-02-22 08:03:47

一個非常簡潔的解決方案。但是，您可以在基本R中執行查找並跳過使用諸如'data.table';之類的外部程序包。 –

@PaulHiemstra我不能這樣做，因爲'grepl'不會接受多於一種模式，除非我會在它們之間使用'|'運算符。在這種情況下，我將無法使用'fixed = TRUE'並失去*很多速度。我在現實生活中一直使用這個解決方案，並發現'stri_detect_fixed'在方便和令人難以置信的速度方面絕對出色。 –

好的，公平的。這也更像是你一直在抨擊dplyr的使用:)。再次，好的解決方案+1。 –

問題是，您不能在if語句中放置邏輯向量。來自R的響應將僅使用邏輯向量中的第一個元素，並拋出你得到的警告消息。

爲了解決這個問題，我將使用grepl。首先，讓我們來創建一些示例數據：

s = c('bla_wi', 'spam_sp', 'egg_sp', 'ham_fa')

接下來，我們需要認識到，你不能將多個搜索模式以grepl。幸運的是，我們可以解決由pattern參數矢量化grepl：

grepl_vec_pattern = Vectorize(grepl, 'pattern') 
grepped_patterns = grepl_vec_pattern(s, pattern = c('_sp', '_su', '_fa', '_wi')) 
grepped_patterns 
#  _sp _su _fa _wi 
# [1,] FALSE FALSE FALSE TRUE 
# [2,] TRUE FALSE FALSE FALSE 
# [3,] TRUE FALSE FALSE FALSE 
# [4,] FALSE FALSE TRUE FALSE

每一列grepped_patterns表示，如果方式順利通過匹配。

接下來我們要這降低到其中列出匹配到元件（假設僅一個圖案明顯匹配），該圖案的矢量：

library(pryr) 
reduce_to_colname_with_true = apply(grepped_patterns, 1, compose(names, which)) 
reduce_to_colname_with_true 
# [1] "_wi" "_sp" "_sp" "_fa"

注意compose(A, B)等於A(B())，即調用嵌套的功能。我選擇使用compose來防止使用匿名函數，如：function(x) names(which(x))。

現在有這樣的信息，我們需要翻譯_sp到spring等：

lut_table = c('_sp' = 'spring', '_su' = 'summer', '_fa' = 'fall', '_wi' = 'winter') 
lut_table[reduce_to_colname_with_true] 
#  _wi  _sp  _sp  _fa 
# "winter" "spring" "spring" "fall"

，我們有所需的結果。要在mutate利用這一點，我們都可以在一個函數把這個包：

calcterm = function(s) { 
    require(pryr) 
    s = as.character(s) 
    grepped_patterns = grepl_vec_pattern(s, pattern = c('_sp', '_su', '_fa', '_wi')) 
    stopifnot(any(rowSums(grepped_patterns) == 1)) # Ensure that there is exactly one match 
    reduce_to_colname_with_true = apply(grepped_patterns, 1, compose(names, which)) 
    lut_table = c('_sp' = 'spring', '_su' = 'summer', '_fa' = 'fall', '_wi' = 'winter') 
    lut_table[reduce_to_colname_with_true] 
} 
library(dplyr) 
df = data.frame(s = s) %>% mutate(term = calcterm(s)) 
df 
     s term 
1 bla_wi winter 
2 spam_sp spring 
3 egg_sp spring 
4 ham_fa fall

來源

2016-02-22 07:56:59

啊 - 我忘了lut！是！！謝謝，這非常有幫助！雖然我可能實際上在某些情況下（也許是這個）實現了@DavidArenburg構造的方法，但您真的很清楚 - 我真的很想看看如何使用我指定的工具來實現這一點。學習不同的方法以及他們如何/爲什麼工作有助於我在將來做出更有效的決策。 –

我需要在calcterm函數中放置'grepl_vec_pattern = Vectorize（grepl，'pattern'）'嗎？ –

@melliot它不是必需的。如果你把它放在全局環境中的函數之外，它會被找到。 –

在R中，使用dplyr的mutate（）創建一個新的變量，以另一個內容爲條件

UPDATE

回答

相關問題