rvest HTML表抓取技術空車返回列表

從HTML表格刮數據時，我已經成功與rvest，但是，對於這個特定的網站，http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/，當我運行的代碼rvest HTML表抓取技術空車返回列表

url <- "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/" 
rankings <- url %>% 
read_html %>% 
html_nodes("table") %>% 
html_table()

返回所有這一切都一個空的列表。什麼可能是錯的？

來源

2016-05-12 Mark Einhorn

這個網站的「問題」是它動態加載一個JavaScript文件，然後通過回調機制來執行它，以創建JS數據，然後構建表/ vis。

獲取數據的一種方法是[硒]硒，但這對許多人來說是有問題的。

另一種方法是使用瀏覽器的開發工具來查看JS請求，運行「Copy as cURL」（通常單擊右鍵），然後使用一些R-fu來獲得所需內容。由於這將會返回javascript，因此我們需要在最終轉換JSON之前進行一些修改。

library(jsonlite) 
library(curlconverter) 
library(httr) 

# this is the `Copy as cURL` result, but you can leave it in your clipboard 
# and not do this in production. Read the `curlconverter` help for more info 

CURL <- "curl 'http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016' -H 'DNT: 1' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54' -H 'Accept: */*' -H 'Referer: http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/' -H 'Connection: keep-alive' -H 'If-Modified-Since: Wed, 11 May 2016 14:47:09 GMT' -H 'Cache-Control: max-age=0' --compressed" 

req <- make_req(straighten(CURL))[[1]] 
req 

# that makes: 

# httr::VERB(verb = "GET", url = "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016", 
#  httr::add_headers(DNT = "1", `Accept-Encoding` = "gzip, deflate, sdch", 
#   `Accept-Language` = "en-US,en;q=0.8", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54", 
#   Accept = "*/*", Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/", 
#   Connection = "keep-alive", `If-Modified-Since` = "Wed, 11 May 2016 14:47:09 GMT", 
#   `Cache-Control` = "max-age=0")) 

# which we can transform into the following after experimenting 

URL <- "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD&jsoncallback=RU3_205_2016" 

pg <- GET(URL, 
      add_headers(
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54", 
      Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/")) 

# now all we need to do is remove the callback 

dat_from_json <- fromJSON(gsub(")$", "", gsub("^RU3_205_2016\\(", "", content(pg, as="text"))), flatten=FALSE) 


# we can also try removing the JSON callback, but it will return XML instead of JSON, 
# which is fine since we can parse that easily 

URL <- "http://omo.akamai.opta.net/competition.php?feed_type=ru3&competition=205&season_id=2016&user=USERNAME&psw=PASSWORD" 

pg <- GET(URL, 
      add_headers(
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36 Vivaldi/1.1.453.54", 
      Referer = "http://www.sanzarrugby.com/superrugby/competition-stats/2016-team-ranking/")) 

xml_doc <- content(pg, as="parsed", encoding="UTF-8") 

# but then you have to transform the XML, which I'll leave as an exercise to the OP :-)

來源

2016-05-12 14:45:00 hrbrmstr

非常感謝你@hrbrmstr，這太棒了！奇蹟般有效！也管理成功解析XML，正是我現在需要的。再次感謝。 –

rvest HTML表抓取技術空車返回列表

回答

相關問題