2013-12-19 73 views
1

我必須在R中的一個非常奇特的特徵之間提取值。在R中的括號之間提取字符串

a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098} 
{3:{112:123123214321}}{4:20:asdasd3214213}" 

這是我的示例串和我想之間提取文本{[0-9]:和},使得我對上面的字符串輸出看起來像

## output should be 
"0987617820" "q312132498s7yd09f8sydf987s6df8797yds9f87098", "{112:123123214321}" "20:asdasd3214213" 
+0

看起來差不多像JSON,也許「rjson」包會幫助你? –

+2

使用正則表達式很難做到這一點,因爲你有一個嵌套的結構。 –

+0

恰恰是嵌套結構正在增加這個問題。 – Shreyes

回答

1

使用PERL。這種方式更強大。

a = "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098}{3:{112:123123214321}}{4:20:asdasd3214213}" 

foohacky = function(str){ 
    #remove opening bracket 
    pt1 = gsub('\\{+[0-9]:', '@@',str) 
    #remove a closing bracket that is preceded by any alphanumeric character 
    pt2 = gsub('([0-9a-zA-Z])(\\})', '\\1',pt1, perl=TRUE) 
    #split up and hack together the result 
    pt3 = strsplit(pt2, "@@")[[1]][-1] 
    pt3 
} 

例如

> foohacky(a) 
[1] "0987617820"         
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098" 
[3] "{112:123123214321}"       
[4] "20:asdasd3214213" 

它還可以與嵌套

> a = "{1:0987617820}{{3:{112:123123214321}}{4:{20:asdasd3214213}}" 
> foohacky(a) 
[1] "0987617820"   "{112:123123214321}" "{20:asdasd3214213}" 
3

這是一種可怕的劈並可能打破你的真實數據。理想情況下,你可以只使用一個分析器,但如果你堅持用正則表達式...好...這不是很

a <- "{1:0987617820}{2:q312132498s7yd09f8sydf987s6df8797yds9f87098} 
{3:{112:123123214321}}{4:20:asdasd3214213}" 

# split based on }{ allowing for newlines and spaces 
out <- strsplit(a, "\\}[[:space:]]*\\{") 
# Make a single vector 
out <- unlist(out) 
# Have an excess open bracket in first 
out[1] <- substring(out[1], 2) 
# Have an excess closing bracket in last 
n <- length(out) 
out[length(out)] <- substring(out[n], 1, nchar(out[n])-1) 
# Remove the number colon at the beginning of the string 
answer <- gsub("^[0-9]*\\:", "", out) 

這給

> answer 
[1] "0987617820"         
[2] "q312132498s7yd09f8sydf987s6df8797yds9f87098" 
[3] "{112:123123214321}"       
[4] "20:asdasd3214213" 

你可以在一個功能包裝類似的東西如果您需要爲多個字符串執行此操作。

+1

請注意,如果您有多個項目嵌套在單個項目中,則會中斷。 – Dason

1

這裏有一個更一般的方式,它會返回{[0-9]:}之間的任何模式允許的{}內單巢比賽。

regPattern <- gregexpr("(?<=\\{[0-9]\\:)(\\{.*\\}|.*?)(?=\\})", a, perl=TRUE) 
a_parse <- regmatches(a, regPattern) 
a <- unlist(a_parse) 
相關問題