正則表達式，R和逗號

我有一些麻煩，在R的正則表達式字符串我試圖使用正則表達式從字符串（從網上刮）提取標籤如下：正則表達式，R和逗號

str <- "\n\n\n \n\n\n  「Don't cry because it's over, smile because it happened.」\n ―\n Dr. Seuss\n\n\n\n\n \n  tags:\n  attributed-no-source,\n  cry,\n  crying,\n  experience,\n  happiness,\n  joy,\n  life,\n  misattributed-dr-seuss,\n  optimism,\n  sadness,\n  smile,\n  smiling\n \n \n  176513 likes\n \n\n\n\n\nLike\n\n" 

# Why doesn't this work at all? 
stringr::str_match(str, "tags:(.+)\\d") 

    [,1] [,2] 
[1,] NA NA 

# Why just the first tag? What happens at the comma? 
stringr::str_match(str, "tags:\n(.+)") 

     [,1]         [,2]       
[1,] "tags:\n  attributed-no-source," "  attributed-no-source,"

所以兩個問題 - 爲什麼我的第一個想法工作，爲什麼不通過字符串的結尾，而不僅僅是第一個逗號第二擷取？

謝謝！

來源

2017-06-29 Alex Gold

這將是有益的，如果你解釋你期望的結果是什麼。 – Dason

難道你的意思'str_match（STR，「標籤：[^ 0-9] * [0-9] *」）'對於第一種情況 – akrun

注意stringr正則表達式的味道是ICU的。不像TRE，.不匹配ICU正則表達式模式換行符。

所以，一個可能的解決辦法是使用(?s) - 一個DOTALL修飾符，使.匹配任何字符，包括換行字符 - 在你的模式開始：

str_match(str, "(?s)tags:(.+)\\d")

和

str_match(str, "(?s)tags:\n(.+)")

不過，我覺得如果你需要得到以下tags:所有的字符串作爲單獨的比賽。我建議使用基礎R regmatches/gregexpr用正則表達式PCRE像

(?:\G(?!\A),?|tags:)\R\h*\K[^\s,]+

查看您的數據regex demo。

(?:\G(?!\A),?|tags:) - 先前成功匹配的1或0 ,結束後匹配它（\G(?!\A),?）或（|）tags:子
\R - 換行符序列
\h* - 0+水平空格
\K - 匹配復位操作丟棄所有的文字迄今匹配
[^\s,]+ - 除空白，1個或多個字符和,

見R demo：

str <- "\n\n\n \n\n\n  「Don't cry because it's over, smile because it happened.」\n ―\n Dr. Seuss\n\n\n\n\n \n  tags:\n  attributed-no-source,\n  cry,\n  crying,\n  experience,\n  happiness,\n  joy,\n  life,\n  misattributed-dr-seuss,\n  optimism,\n  sadness,\n  smile,\n  smiling\n \n \n  176513 likes\n \n\n\n\n\nLike\n\n" 
reg <- "(?:\\G(?!\\A),?|tags:)\\R\\h*\\K[^\\s,]+" 
vals <- regmatches(str, gregexpr(reg, str, perl=TRUE)) 
unlist(vals)

結果：

[1] "attributed-no-source" "cry" "crying" 
[4] "experience" "happiness" "joy" 
[7] "life" "misattributed-dr-seuss" "optimism" 
[10] "sadness" "smile" "smiling"

來源

2017-06-29 18:20:01

正則表達式，R和逗號

回答

相關問題