正則表達式從字符串中提取R中

我正則表達式從字符串中提取R中

string = <td class=\"title\"><a href=\"/title/tt0075669/\">Amar Akbar Anthony</a><div class=\"desc_preview\" title=\"10/10&#10;votes 2\"> </div>\n</td>

我使用的代碼

library(stringr) 
str_extract(string,"[A-Z]\\w+")

對於這個我得到的結果

> str_extract(string,"[A-Z]\\w+") 
[1] "Amar"

但是我想「的字符串阿馬爾阿克巴安東尼「作爲我的輸出。我應該如何改變我的正則表達式呢？

來源

2016-09-26 Rajarshi Bhadra

添加一個空格 - ' 「[A-Z] [\\ W \\ S] +」' –

這是我想要的到底是什麼 –

請注意，您的正則表達式不允許有空格。其添加爲[\\w\\s]：

"[A-Z][\\w\\s]+"

另外，如果你的字符串總是在以上格式，你甚至都不需要stringr庫，使用基礎R gsub：

s <- "<td class=\"title\"><a href=\"/title/tt0075669/\">Amar Akbar Anthony</a><div class=\"desc_preview\" title=\"10/10&#10;votes 2\"> </div>\n</td>" 
trimws(gsub("<[^>]+>","",s)) 
[1] "Amar Akbar Anthony"

見this online demo。 gsub("<[^>]+>","",s)將刪除所有打開/關閉/等。標籤。

或者使用XML解析庫搶a標籤值：

> library("XML") 
> s <- "<td class=\"title\"><a href=\"/title/tt0075669/\">Amar Akbar Anthony</a><div class=\"desc_preview\" title=\"10/10&#10;votes 2\"> </div>\n</td>" 
> parsed_doc = htmlParse(s, useInternalNodes = TRUE) 
> res <- getNodeSet(doc = parsed_doc, path = "//a/text()") 
> plain_text <- sapply(res, xmlValue) 
> plain_text 
[1] "Amar Akbar Anthony"

來源

2016-09-26 08:33:37

編輯：哎呀！我誤解了你的問題。我通常從兩個HTML標籤之間提取東西的方式是在「>」上使用正向倒序，然後讀取所有內容，直到下一個「<」。

string = "<td class=\"title\"><a href=\"/title/tt0075669/\">Amar Akbar Anthony</a><div class=\"desc_preview\" title=\"10/10&#10;votes 2\"> </div>\n</td>" 

str_extract(string,"(?<=>)[^<]+")

這有點脆弱。更好的答案是你不使用正則表達式來解析HTML。（htmlTreeParse()從XML library是一種方法;該httr package也有這樣的功能。）

我原來的答案，提取所有單詞作爲一個列表：

開關從str_extract()到str_extract_all()

str_extract(string,"[A-Z]\\w+") 
[1] "Amar" 

str_extract_all(string,"[A-Z]\\w+") 
[[1]] 
[1] "Amar" "Akbar" "Anthony"

來源

2016-09-26 08:21:22

正則表達式從字符串中提取R中

回答

相關問題