2015-09-26 34 views
0

我有一串亂七八糟的字符串,如下所示。用於提取複雜字符串的R正則表達式

string <- c("GRP-14994/", "GRP-7056 GRP-7036/", "grp-24263(24263)/IRGC 28588", "GRP-15916 /IRGC-42176", 
      "GRP-614-250B/", "(GRP 11432)/IRGC-14570", "Tourn", "GRPP256", "Purse", "GRP-14956 Origin:", "GRP 10537", "GRP-10096 Origin: ", 
      "SGRP123", "GRP1234", "AC-30009 (GRPHANA)/", "AC-3060 GRP 536-143/Old AC", "RGRPfaa/23", "/-", 
      "MGR:7251/", "1216-GR-567/", "X:1 Well KGRPh", "WabGRPvea(II)", "HR33(BGRP)", "Tensor", 
      "Wald", "grp12312") 

我試圖提取所有的實例,其中GRP後面跟數字,可能由空格或「 - 」分隔。

我目前的嘗試給了我以下結果。

gsub("(.*)(\\b)(GRP)(-|\\s|)(\\d+)(\\/|\\b)(.*)","\\3\\5", string, ignore.case = T) 
[1] "GRP14994"   "GRP7056"    "grp24263"   "GRP15916"   
[5] "GRP614"    "GRP11432"   "Tourn"    "GRPP256"    
[9] "Purse"    "GRP14956"   "GRP10537"   "GRP10096"   
[13] "SGRP123"    "GRP1234"    "AC-30009 (GRPHANA)/" "GRP536"    
[17] "RGRPfaa/23"   "/-"     "MGR:7251/"   "1216-GR-567/"  
[21] "X:1 Well KGRPh"  "WabGRPvea(II)"  "HR33(BGRP)"   "Tensor"    
[25] "Wald"    "grp12312"  

但所需的輸出RIS

out <- c("GRP14994", "GRP7056 GRP7036", "grp24263", "GRP15916", "GRP614250", 
"GRP11432", "", "", "", "GRP14956", "GRP10537", "GRP10096", "", 
"GRP1234", "", "GRP536143", "", "", "", "", "", "", "", "", "", 
"grp12312") 

out 
[1] "GRP14994"  "GRP7056 GRP7036" "grp24263"  "GRP15916"  "GRP614250"  "GRP11432"  
[7] ""    ""    ""    "GRP14956"  "GRP10537"  "GRP10096"  
[13] ""    "GRP1234"   ""    "GRP536143"  ""    ""    
[19] ""    ""    ""    ""    ""    ""    
[25] ""    "grp12312"  

如何修改正則表達式來獲得所需的結果?在你的模式中發現

+1

您所提供的預期輸出看起來不正確的。 'GRP614'不會是'GRP614250'嗎?和'GRPP256'?它有兩個** P ** s – hwnd

+1

如果這是您的輸入,並且您確定輸入的數據,您可以通過使用^ without()開始您的正則表達式來強制字符串以給定的GRP字符串開頭。*),以便它匹配所有以GRP開頭的字符串 – LMG

+1

GRPP256' ... – hwnd

回答

1
unlist(lapply(str_extract_all(string,"[Gg][rR][pP][-\\s]?\\d+"), function (x) { gsub("[-\\s]+(\\d)", "\\1", paste(x, collapse= " "),perl=T) })) 
[1] "GRP14994"  "GRP7056 GRP7036" "grp24263"  
[4] "GRP15916"  "GRP614"   "GRP11432"  
[7] ""    ""    ""    
[10] "GRP14956"  "GRP10537"  "GRP10096"  
[13] "GRP123"   "GRP1234"   ""    
[16] "GRP536"   ""    ""    
[19] ""    ""    ""    
[22] ""    ""    ""    
[25] ""    "grp12312" 
+1

不給出正確的輸出。 – hwnd

+0

其中?????? //// –

+0

自己看看吧。 – hwnd

1

你的模式

(.*)(\\b)(GRP)(-|\\s|)(\\d+)(\\/|\\b)(.*)","\\3\\5 

故障

。你想捕捉到這樣的事情GRP-668-888,但在你的 模式時提供了選項僅用於連字符,後跟數字 即GRP-668

。由於您沒有使用其他詞語,因此在您的模式之前和之後不需要(.*)的貪婪表達式 。您可以 只是利用"因爲它總是GRP

之前。也不需要在(GRP)之前\\b這個字的邊界 你的模式。

這些是我現在可以檢測到的重要的。

你可以和嘗試下面

gsub("(grp)[-\s]?(\d+)[-\s]?(\d+)","\\1\\2\\3", string, ignore.case = T) 

grp:這種模式如果捕捉GRP其在字符串

[-\s]?:捕獲連字符-或空間\s可以是可選

(\d+):捕捉一個或多個數字

DEMO