正則表達式是可能的,但我更喜歡rvest
包本,
這是data.table或dplyr更容易,但讓這樣做它基礎R,(在關閉的機會,這些都是新的概念)
# Example data
df <- structure(list(name = c("John", "Steve"), html = c("<span class=\"incident-icon\" data-minute=\"68\" data-second=\"37\" data-id=\"8028\"></span><span class=\"name-meta-data\">68</span>",
"<span class=\"incident-icon\" data-minute=\"69\" data-second=\"4\" data-id=\"132205\"></span><span class=\"name-meta-data\">69</span>"
)), .Names = c("name", "html"), row.names = c(NA, -2L), class = "data.frame")
rvest讓我們使用DOM,可以比使用正則表達式的工作同樣的事情要好很多拆分這件事。
library(rvest)
# Get span attributes from each row:
spanattrs <-
lapply(df$html,
function(y) read_html(y) %>% html_node('span') %>% html_attrs)
# rbind to get a data.frame with all attributes
final <- data.frame(df, do.call(rbind,spanattrs))
> final
name html class
1 John <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> incident-icon
2 Steve <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span> incident-icon
data.minute data.second data.id
1 68 37 8028
2 69 4 132205
讓我們刪除HTML,所以它在這裏的觀衆更好一點:
> final$html <- NULL
> final
name class data.minute data.second data.id
1 John incident-icon 68 37 8028
2 Steve incident-icon 69 4 132205