2013-01-24 128 views
1

我想使用單個正則表達式從字符串中提取幾條數據。我做了一個模式,其中包括這些作爲子表達式在括號中的作品。在類似perl的環境中,我只是簡單地通過代碼myvar1=$1; myvar2=$2;等將這些子表達式傳遞給變量 - 但是如何在R中執行此操作? 目前,我發現訪問這些事件的唯一方法是通過regexec。這不是很方便,因爲regexec不支持perl語法和其他原因。這就是我現在要做的:R:從正則表達式中提取子表達式出現

getoccurence <- function(text,rex,n) { # rex is the result of regexec function 
    occstart <- rex[[1]][n+1] 
    occstop <- occstart+attr(rex[[1]],'match.length')[n+1]-1 
    occtext <- substr(text,occstart[i],occstop) 
    return(occtext) 
} 
mytext <- "junk text, 12.3456, -01.234, valuable text before comma, all the rest" 
mypattern <- "([0-9]+\\.[0-9]+), (-?[0-9]+\\.[0-9]+), (.*)," 
rez <- regexec(mypattern, mytext) 
var1 <- getoccurence(mytext, rez, 1) 
var2 <- getoccurence(mytext, rez, 2) 
var3 <- getoccurence(mytext, rez, 3) 

顯然,它是相當笨拙的解決方案,應該有更好的東西。我會很感激任何意見。

回答

2

你看過regmatches嗎?

> regmatches(mytext, rez) 
[[1]] 
[1] "12.3456, -01.234, valuable text before comma," "12.3456"          
[3] "-01.234"      "valuable text before comma"     

> sapply(regmatches(mytext, rez), function(x) x[4]) 
[1] "valuable text before comma" 
+0

哎喲,的確!我當然讀了regmatches的描述,但不知何故忽略了這一點:(非常感謝你!!! –

+0

P.S.現在我明白了爲什麼:我試圖只在regexpr之後使用regmatches,而不是在regexec之後... –

1

stringr,這是str_matchstr_match_all(如果你想在字符串中的模式的每次出現匹配。str_match返回一個矩陣,str_match_all返回矩陣

library(stringr) 
str_match(mytext, mypattern) 
str_match_all(mytext, mypattern) 
1

strapply和列表strapplycgsubfn package可以做到這一步:

> strapplyc(mytext, mypattern) 
[[1]] 
[1] "12.3456"     "-01.234"     
[3] "valuable text before comma" 

> # with simplify = c argument 
> strapplyc(mytext, mypattern, simplify = c) 
[1] "12.3456"     "-01.234"     
[3] "valuable text before comma" 

> # extract second element only 
> strapply(mytext, mypattern, ... ~ ..2) 
[[1]] 
[1] "-01.234" 

> # specify function slightly differently and use simplify = c 
> strapply(mytext, mypattern, ... ~ list(...)[2], simplify = c) 
[1] "-01.234" 

> # same 
> strapply(mytext, mypattern, x + y + z ~ y, simplify = c) 
[1] "-01.234" 

> # same but also convert to numeric - also can use with other variations above 
> strapply(mytext, mypattern, ... ~ as.numeric(..2), simplify = c) 
[1] -1.234 

在上面的例子中,第三個參數可以是一個函數,也可以是一個被轉換成函數的公式(LHS代表參數,RHS是body)。