2016-12-04 62 views
0

我的解析器創建一個數據幀,它看起來像:如何清理和拆分R中的HTML標籤?

name   html 
1 John   <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> 
2 Steve   <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span> 

那麼,如何從HTML中提取有用的信息?例如,我想用一些HTML屬性爲特徵:

name minute second  id 
1 John  68  37 8028 
2 Steve  69  4 132205 

回答

1

正則表達式是可能的,但我更喜歡rvest包本,

這是data.table或dplyr更容易,但讓這樣做它基礎R,(在關閉的機會,這些都是新的概念)

# Example data 

df <- structure(list(name = c("John", "Steve"), html = c("<span class=\"incident-icon\" data-minute=\"68\" data-second=\"37\" data-id=\"8028\"></span><span class=\"name-meta-data\">68</span>", 
"<span class=\"incident-icon\" data-minute=\"69\" data-second=\"4\" data-id=\"132205\"></span><span class=\"name-meta-data\">69</span>" 
)), .Names = c("name", "html"), row.names = c(NA, -2L), class = "data.frame") 

rvest讓我們使用DOM,可以比使用正則表達式的工作同樣的事情要好很多拆分這件事。

library(rvest) 

# Get span attributes from each row: 
spanattrs <- 
    lapply(df$html, 
      function(y) read_html(y) %>% html_node('span') %>% html_attrs) 

# rbind to get a data.frame with all attributes 
final <- data.frame(df, do.call(rbind,spanattrs)) 

> final 
    name                              html   class 
1 John <span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> incident-icon 
2 Steve <span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span> incident-icon 
    data.minute data.second data.id 
1   68   37 8028 
2   69   4 132205 

讓我們刪除HTML,所以它在這裏的觀衆更好一點:

> final$html <- NULL 
> final 
    name   class data.minute data.second data.id 
1 John incident-icon   68   37 8028 
2 Steve incident-icon   69   4 132205 
3

如果你已經在你的問題中的數據幀,你可以嘗試以下。您的數據幀在這裏被稱爲mydf。您可以使用stri_extract_all_regex()提取所有數字。然後,遵循將列表轉換爲數據框的經典方法。然後,分配新列名稱並將結果與​​原始數據框中的列name綁定。

library(stringi) 
library(dplyr) 

stri_extract_all_regex(str = mydf$url, pattern = "[0-9]+") %>% 
unlist %>% 
matrix(ncol = 4, byrow = T) %>% 
data.frame %>% 
setNames(c("minute", "second", "ID", "data")) %>% 
bind_cols(mydf["name"], .) 

# name minute second  ID data 
#1 John  68  37 8028 68 
#2 Steve  69  4 132205 69 

DATA

mydf <- structure(list(name = c("John", "Steve"), url = c("<span class=\"incident-icon\" data-minute=\"68\" data-second=\"37\" data-id=\"8028\"></span><span class=\"name-meta-data\">68</span>", 
"<span class=\"incident-icon\" data-minute=\"69\" data-second=\"4\" data-id=\"132205\"></span><span class=\"name-meta-data\">69</span>" 
)), .Names = c("name", "url"), row.names = c(NA, -2L), class = "data.frame") 
1

一種替代rvest方法使用purrrdplyr

library(rvest) 
library(purrr) 
library(dplyr) 

df <- read.table(stringsAsFactors=FALSE, header=TRUE, sep=",", text='name,html 
John,<span class="incident-icon" data-minute="68" data-second="37" data-id="8028"></span><span class="name-meta-data">68</span> 
Steve,<span class="incident-icon" data-minute="69" data-second="4" data-id="132205"></span><span class="name-meta-data">69</span>') 

by_row(df, .collate="cols", 
     ~read_html(.$html) %>% 
     html_nodes("span:first-of-type") %>% 
     html_attrs() %>% 
     flatten_chr() %>% 
     as.list() %>% 
     flatten_df()) %>% 
    select(-html, -class1) %>% 
    setNames(gsub("^data-|1$", "", colnames(.))) 
## # A tibble: 2 × 4 
## name minute second  id 
## <chr> <chr> <chr> <chr> 
## 1 John  68  37 8028 
## 2 Steve  69  4 132205