2017-06-29 190 views
1

我有一些麻煩,在R的正則表達式字符串我試圖使用正則表達式從字符串(從網上刮)提取標籤如下:正則表達式,R和逗號

str <- "\n\n\n \n\n\n  「Don't cry because it's over, smile because it happened.」\n ―\n Dr. Seuss\n\n\n\n\n \n  tags:\n  attributed-no-source,\n  cry,\n  crying,\n  experience,\n  happiness,\n  joy,\n  life,\n  misattributed-dr-seuss,\n  optimism,\n  sadness,\n  smile,\n  smiling\n \n \n  176513 likes\n \n\n\n\n\nLike\n\n" 

# Why doesn't this work at all? 
stringr::str_match(str, "tags:(.+)\\d") 

    [,1] [,2] 
[1,] NA NA 

# Why just the first tag? What happens at the comma? 
stringr::str_match(str, "tags:\n(.+)") 

     [,1]         [,2]       
[1,] "tags:\n  attributed-no-source," "  attributed-no-source," 

所以兩個問題 - 爲什麼我的第一個想法工作,爲什麼不通過字符串的結尾,而不僅僅是第一個逗號第二擷取?

謝謝!

+0

這將是有益的,如果你解釋你期望的結果是什麼。 – Dason

+0

難道你的意思'str_match(STR, 「標籤:[^ 0-9] * [0-9] *」)'對於第一種情況 – akrun

回答

3

注意stringr正則表達式的味道是ICU的。不像TRE,.不匹配ICU正則表達式模式換行符。

所以,一個可能的解決辦法是使用(?s) - 一個DOTALL修飾符,使.匹配任何字符,包括換行字符 - 在你的模式開始:

str_match(str, "(?s)tags:(.+)\\d") 

str_match(str, "(?s)tags:\n(.+)") 

不過,我覺得如果你需要得到以下tags:所有的字符串作爲單獨的比賽。我建議使用基礎R regmatches/gregexpr用正則表達式PCRE像

(?:\G(?!\A),?|tags:)\R\h*\K[^\s,]+ 

查看您的數據regex demo

  • (?:\G(?!\A),?|tags:) - 先前成功匹配的1或0 ,結束後匹配它(\G(?!\A),?)或(|tags:
  • \R - 換行符序列
  • \h* - 0+水平空格
  • \K - 匹配復位操作丟棄所有的文字迄今匹配
  • [^\s,]+ - 除空白,1個或多個字符和,

R demo

str <- "\n\n\n \n\n\n  「Don't cry because it's over, smile because it happened.」\n ―\n Dr. Seuss\n\n\n\n\n \n  tags:\n  attributed-no-source,\n  cry,\n  crying,\n  experience,\n  happiness,\n  joy,\n  life,\n  misattributed-dr-seuss,\n  optimism,\n  sadness,\n  smile,\n  smiling\n \n \n  176513 likes\n \n\n\n\n\nLike\n\n" 
reg <- "(?:\\G(?!\\A),?|tags:)\\R\\h*\\K[^\\s,]+" 
vals <- regmatches(str, gregexpr(reg, str, perl=TRUE)) 
unlist(vals) 

結果:

[1] "attributed-no-source" "cry" "crying" 
[4] "experience" "happiness" "joy" 
[7] "life" "misattributed-dr-seuss" "optimism" 
[10] "sadness" "smile" "smiling"