2017-09-27 90 views
1
require(httr) 
require(XML) 
basePage <- "http://bet.hkjc.com/" 
h <- handle(basePage) 
GET(handle = h) 
res <- GET(handle = h, path = "racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2") 
resXML <- htmlParse(content(res, as = "text")) 

我用上面的代碼來刮一個aspx。網站。它返回了一堆文本。不過,我只想獲得「var infoDivideByRace」,「var scratchList」。請問如何提取這兩個變量並將它們轉換爲列數據?謝謝!部分退貨如下:用R刮,如何提取var

var poolSellStatus = '[email protected]@@@@@;WIN;PLA;W-P;QIN;QPL;QQP;TRI;DBL;TCE;F-F;QTT;CWA;'.split('@@@'); 
var poolSellStatus_bak = '[email protected]@@@@@;WIN;PLA;W-P;QIN;QPL;QQP;TRI;DBL;TCE;F-F;QTT;CWA;'.split('@@@'); 
var winOddsByRace = '[email protected]@@@@@WIN;1=3.6=1;2=4.7=0;3=43=0;4=11=0;5=29=0;6=9.4=0;7=4.6=0;8=11=0;9=52=0;10=82=0;11=52=0;12=8.6=0#PLA;1=1.4=1;2=2.0=0;3=6.0=0;4=3.5=0;5=6.2=0;6=2.6=0;7=2.0=0;8=4.2=0;9=7.9=0;10=11=0;11=8.4=0;12=2.5=0'.split('@@@'); 
var multiRacePoolsStr = '@@@DBL#;1,2;2,3;3,4;4,5;5,6;6,7;7,[email protected]@@TBL#;6,7,[email protected]@@D-T#;3,4;6,[email protected]@@T-T#;4,5,[email protected]@@6UP#;3,4,5,6,7,8'; 
var fieldSize = 'HV;12;12;12;12;12;12;12;12'; 
var fieldSizeWithReserve = 'HV;12;12;12;12;12;12;12;12'; 
var reserveList = 'HV'; 
var scratchList = 'HV'; 

回答

0

最簡單或最合適的方法是使用Phantomjs或硒。如果沒有,Regexrvest變通。

library(rvest) 

basePage <- "http://bet.hkjc.com/" 

ss <- paste0(basePage,path) 

path = "racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2" 

scripts <- read_html(ss, encoding = 'utf8') %>% 
    html_nodes("script") %>% html_text(trim=TRUE) 

new <- scripts[grepl('var scratchList =|var infoDivideByRace = ',scripts)] 

value1 <- str_replace_all(strsplit(str_extract(new,regex('var scratchList = (.*?);')), split=' ')[[1]][4],";|'",'')  
value2 <- str_replace_all(strsplit(str_extract(new,regex('var infoDivideByRace = (.*?);')),split=' ')[[1]][4],";|'",'') 

value1 
#[1] "HV" 

value2 
使用V8包
0

備用選項:

library(rvest) 
library(stringi) 
library(purrr) 
library(V8) 

獲取您指定的網頁內容:包含您的目標變量

pg <- read_html("http://bet.hkjc.com/racing/pages/odds_wp.aspx?date=27-09-2017&venue=HV&raceno=2", encoding = "UTF-8") 

提取腳本標記,腳本標籤轉換爲文本,分成一個字符向量,只保留var行:

html_nodes(pg, xpath=".//script[contains(., 'infoDivideByRace')]") %>% 
    html_text() %>% 
    stri_split_lines() %>% 
    flatten_chr() %>% 
    keep(stri_detect_regex, "^var") -> script_txt 

初始化的V8 JavaScript引擎:

ctx <- v8() 

讓它解析javascript和創建數據:

ctx$eval(script_txt) 

檢索數據(infoDivideByRace具有2個空白數組元素,所以我們忽略它們):

grep("^$", ctx$get('infoDivideByRace'), value=TRUE, invert=TRUE) 
## [1] STACKOVERFLOW'S SPAM PROTECTION WON'T LET ME PASTE THIS CONTENT 

ctx$get('scratchList') 
[1] "HV" 
+0

以上不起作用... 它返回:Flatten_chr(。)中的錯誤:不能fin d函數「flatten_chr」 –

+0

我忘了'庫(purrr)'(我已經添加到帖子中) – hrbrmstr