如果字符串列中存在$符號，子集數據幀

我有一個dataframe,time列和string列。我想要subset這個dataframe - 在那裏我只保留其中string列包含$符號的行。

子集後，我要清理string列，使其只包含characters的$符號之後，直到有一個space或symbol

df <- data.frame("time"=c(1:10), 
"string"=c("$ABCD test","test","test $EFG test", 
"$500 test","$HI/ hello","test $JK/", 
"testing/123","$MOO","$abc","123"))

我想最終的輸出是：

Time string 
1  ABCD 
3  EFG 
4  500 
5  HI 
6  JK 
8  MOO 
9  abc

它只保留在字符串列中有$的行，然後只保留之後的字符10個符號，並直至space或symbol

我已經取得了一些成功sub簡單地拉出string，但一直沒能適用於該df和其子集。謝謝你的幫助。

來源

2017-03-25 newtoR

我們可以通過regexpr/regmatches提取子這樣做僅提取遵循$

i1 <- grep("$", df$string, fixed = TRUE) 
transform(df[i1,], string = regmatches(string, regexpr("(?<=[$])\\w+", string, perl = TRUE))) 
# time string 
#1 1 ABCD 
#3 3 EFG 
#4 4 500 
#5 5  HI 
#6 6  JK 
#8 8 MOO 
#9 9 abc

子

或與tidyverse語法

library(tidyverse) 
df %>% 
    filter(str_detect(string, fixed("$"))) %>% 
    mutate(string = str_extract(string, "(?<=[$])\\w+"))

來源

2017-03-26 04:23:44 akrun

直到有人想出了漂亮regex解決方案，這是我的看法：

# subset for $ signs and convert to character class 
res <- df[ grepl("$", df$string, fixed = TRUE),] 
res$string <- as.character(res$string) 

# split on non alpha and non $, and grab the one with $, then remove $ 
res$clean <- sapply(strsplit(res$string, split = "[^a-zA-Z0-9$']", perl = TRUE), 
        function(i){ 
         x <- i[grepl("$", i, fixed = TRUE)] 
         # in case when there is more than one $ 
         # x <- i[grepl("$", i, fixed = TRUE)][1] 
         gsub("$", "", x, fixed = TRUE) 
        }) 
res 
# time   string clean 
# 1 1  $ABCD test ABCD 
# 3 3 test $EFG test EFG 
# 4 4  $500 test 500 
# 5 5  $HI/ hello HI 
# 6 6  test $JK/ JK 
# 8 8   $MOO MOO 
# 9 9   $abc abc

來源

2017-03-25 22:23:35 zx8754

這真是太好了，謝謝。有一件事我在我沒有預見到的數據集上運行時遇到了 - 有些字符串實際上有多次出現'$ string' - 例如，一個值可能是$ ABCD test $ EBC和$ FB' - 這產生了一個值c（「ABCD」，「EBC」，「FB」）'。是否有可能只存儲第一次出現？謝謝！ – newtoR

@newtoR使用這一行來獲得只有第一個出現'x < - i [grepl（「$」，i，fixed = TRUE）] [1]'，作爲註釋添加到帖子中 – zx8754

如果字符串列中存在$符號，子集數據幀

回答

相關問題