使用dplyr mutate查找字符串中的字符位置

我有一列字符串的數據框，數字標識後跟「 - 」，然後是一個月份。我試圖解析字符串來獲取月份和年份。作爲第一步，我用dplyr ::發生變異（）和使用dplyr mutate查找字符串中的字符位置

regexpr() 
regexpr("-",yearid)[1]

創建一個新列，顯示這個位置「 - 」字符。但似乎regexpr（）在mutate（）中執行的方式與單獨使用時完全不同。它似乎並沒有根據字符串進行更新，而是從前面的行中繼承字符串位置。在下面的例子中，我預計「 - 」字符的位置分別爲4,4和5。但我得到4,4和4 - 所以這4個是不正確的。當我分別運行regexpr時，我沒有看到這個問題。

想知道我是否缺少一些東西，以及如何獲取「 - 」的位置是動態的，並且是針對yearid的每個值的特定位置？可能有一個更簡單的方式來獲得月，和1997年

yearid <- c("50 - January 1995","51 - January 1996","100 - January 1997") 
data.df <- data.frame(yearid) 
data.df <- mutate(data.df, trimpos = regexpr("-",str_trim(yearid))[1], 
       pos = regexpr("-",yearid)[1]) 

> data.df 
       yearid test1 test2 
1 50 - January 1995  4  4 
2 51 - January 1996  4  4 
3 100 - January 1997  4  4

使用regexpr因此我得到的輸出另一方面預期：

> regexpr("-",yearid[1])[1] 
[1] 4 
> regexpr("-",yearid[2])[1] 
[1] 4 
> regexpr("-",yearid[3])[1] 
[1] 5

最後，我有我的sessionInfo（）下面

R version 3.1.1 (2014-07-10) 
Platform: x86_64-apple-darwin10.8.0 (64-bit) 

locale: 
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 

attached base packages: 
[1] stats  graphics grDevices utils  datasets methods base  

other attached packages: 
[1] stringr_1.0.0 dplyr_0.4.1  readr_0.2.2.9000 

loaded via a namespace (and not attached): 
[1] assertthat_0.1  DBI_0.3.1   knitr_1.10.5    lazyeval_0.1.10.9000 magrittr_1.5   parallel_3.1.1  
[7] Rcpp_0.11.6   stringi_0.4-1  tools_3.1.1

來源

2016-05-11 rajvijay

剛剛從你的'mutate'表達下降了'[1]單曲。 – nrussell

當沒有分組和'regexpr（）'被矢量化時，使用dplyr非常毫無意義。 –

@nrussell謝謝 - 如果可能有幫助，正則表達式的任何想法[1]都是導致我注意到的問題的原因之一？只是想確保我也有潛在的問題。 – rajvijay

從stringr庫regexpr函數返回附match.length一個位置的具有兩個附加屬性的向量d useBytes。正如評論中提到的，這個向量可以直接分配給數據幀。這可以使用mutate函數或不使用。

library(dplyr) 
library(stringr) 

id_month_year <- c(
    "50 - January 1995", 
    "51 - January 1996", 
    "100 - January 1997" 
) 
data <- data.frame(id_month_year, another_column = 1) 

## create new column using mutate 
data <- data %>% mutate(pos1 = regexpr("-", data$id_month_year)) 

## create new column without mutate 
data$pos2 <- regexpr("-", data$id_month_year) 

print(data)

這裏是新列：

 id_month_year another_column pos1 pos2 
1 50 - January 1995    1 4 4 
2 51 - January 1996    1 4 4 
3 100 - January 1997    1 5 5

我會建議使用separate函數從tidyr庫。下面是一個示例代碼片段：

library(dplyr) 
library(tidyr) 

id_month_year <- c(
    "50 - January 1995", 
    "51 - January 1996", 
    "100 - January 1997" 
) 
data <- tbl_df(data.frame(id_month_year, another_column = 1)) 

clean <- data %>% 
    separate(
     id_month_year, 
     into = c("id", "month", "year"), 
     sep = "[- ]+", 
     convert = TRUE 
    ) 

print(clean)

而這裏的產生乾淨的數據幀：

Source: local data frame [3 x 4] 

    id month year another_column 
    (int) (chr) (int)   (dbl) 
1 50 January 1995    1 
2 51 January 1996    1 
3 100 January 1997    1

來源

2016-05-11 18:18:24 Andrew

使用dplyr mutate查找字符串中的字符位置

回答

相關問題