2015-10-21 35 views
0

我有這樣一個字符串矢量(一個更大的一個組成部分):R:指數只是一個模式的第一次出現後,另一種模式

a <- c("My string", 
     "characters", 
     "sentence", 
     "text.", 
     "My string word sentence word.", 
     "Other thing word sentence characters.", 
     "My string word sentence numbers.", 
     "Other thing", 
     "word.", 
     "sentence", 
     "text.", 
     "Other thing word. characters sentence.", 
     "Different string word text.", 
     "Different string.", 
     "word.", 
     "sentence.", 
     "My string", 
     "word", 
     "sentence", 
     "things.", 
     "My string word sentence blah.") 

正如你看到的,矢量包含一些表情,其中一些在一個單一的元素,其他人分裂在多個元素(這很好)。還要注意,其中一些在單個或拆分字符串中有多個句點。我想要實現的是提取那些以My string開頭並以相同元素中的句點結束(如果整個表達式在單個字符串中)或結束以My string開頭的表達式的最後一個元素的結尾。

我想象這一點的方式是第一,包含索引的所有元素My string

> b <- grep(pattern = "My string", x = a, fixed = TRUE) 
> b 
[1] 1 5 7 17 21 

然後,索引是在所述字符串的末尾所有時期:

> c <- grep(pattern = "\\.$", x = a) 
> c 
[1] 4 5 6 7 9 11 12 13 14 15 16 20 21 

,並在結束時,只能獲得以My string(單個元素或跨越元素)開始的每個表達式之後的第一個週期的位置。之後開始表現的每一個

d <- c("My string", 
     "characters", 
     "sentence", 
     "text.", 
     "My string word sentence word.", 
     "My string word sentence numbers.", 
     "My string", 
     "word", 
     "sentence", 
     "things.", 
     "My string word sentence blah.") 

有人可以與最後一步幫助(僅獲得第一個週期的位置:然後,它會很容易只子集只是我需要得到像這樣的元素與My string)?

回答

1

我覺得這樣的事情會做你想要

b <- grep(pattern = "My string", x = a, fixed = TRUE) 
c <- grep(pattern = "\\.$", x = a) 

# find first period for each start string 
e <- sapply(b, function(x) head(c[c>=x],1)) 

# extract ranges 
d <- a[unlist(Map(`:`, b,e))] 

# [1] "My string"      
# [2] "characters"      
# [3] "sentence"       
# [4] "text."       
# [5] "My string word sentence word." 
# [6] "My string word sentence numbers." 
# [7] "My string"      
# [8] "word"        
# [9] "sentence"       
# [10] "things."       
# [11] "My string word sentence blah." 
+0

是的,這正是我所需要的。非常感謝你! – panman

2

這裏有什麼用dplyr

library(dplyr) 

a <- c("My string", 
     "characters", 
     "sentence", 
     "text.", 
     "My string word sentence word.", 
     "Other thing word sentence characters.", 
     "My string word sentence numbers.", 
     "Other thing", 
     "word.", 
     "sentence", 
     "text.", 
     "Other thing word. characters sentence.", 
     "Different string word text.", 
     "Different string.", 
     "word.", 
     "sentence.", 
     "My string", 
     "word", 
     "sentence", 
     "things.", 
     "My string word sentence blah.") 

data.frame(a = a, 
      stringsAsFactors = FALSE) %>% 
    mutate(period = grepl("[.]", a), 
     sentence_id = lag(cumsum(period), default = 0)) %>% 
    group_by(sentence_id) %>% 
    mutate(retain = any(grepl("My string", a))) %>% 
    ungroup() %>% 
    filter(retain) 

過程中的另一種方法是確定有一個週期單元,並使用這些指標來指示當新的句子開始時。這給了我們一個sentence_id來進行分組,然後我們只需要查找字符串"My string"

+0

非常感謝! – panman

相關問題