2014-02-28 121 views
1

我倒過相關的問題無濟於事。我需要根據我指定的日期和小時從ASP.NET網頁(http://www.spp.org/LIP.asp)中刮取價格信息表。我很滿意並想使用R.我的基本絆腳石是URL不反映搜索參數,它是靜態的,我也不知道如何在ASP.NET中使用Javascript提交HTML表單現場。使用R按鈕從ASP.NET網頁中刮取表格

我查看了上面URL的源代碼。我發現在iframe中有一個鏈接指向另一個'源數據'頁面:http://www.spp.org/LIPPosting/LIP.aspx。我嘗試在R基於這個StackOverflow線程做一個POST請求:What if I want to web scrape with R for a page with parameters?

##ASP.NET site scrape 
forms = getHTMLFormDescription("http://www.spp.org/LIPPosting/LIP.aspx") 
# Name the list for easy reference 
names(forms)='spp' 
# Use the createFunction tool so I can submit a search 
fun = createFunction(forms$spp, verbose=T) 
# Submit an HTML form looking for data using all form defaults 
# Except change the hour to '03' 
results <- fun(ddlHour = '03') 
# Grab the table results from the HTML based on its id tag 
tableData <- getNodeSet(htmlParse(results), "//*/table[@id = 'dgLIP']") 
readHTMLTable(tableData[[1]]) 

HTML結果顯示在'小時'表單元素中,我確實選擇了'03'。

  <td style="height: 42px; width: 77px;"> 
<span id="lblLIPHour">Hour</span><br><select name="ddlHour" id="ddlHour"><option value="1">01</option> 
<option value="2">02</option> 
<option selected value="3">03</option> 
<option value="4">04</option> 
<option value="5">05</option> 
<option value="6">06</option> 
<option value="7">07</option> 
<option value="8">08</option> 

然而,這一請求沒有得到傳遞給服務器,因爲當我看看實際的結果見表是當前時間,而不是「03」。

> readHTMLTable(tableData[[1]]) 
    Publish Date Price Date    PNode Price  Parent PNode Settlement Location 
1 201402281552 201402281600     AECI 23.45    AECI    AECI 
2 201402281552 201402281600     AMRN 23.45    AMRN    AMRN 
3 201402281552 201402281600     BLKW 23.45    BLKW    BLKW 
4 201402281552 201402281600     CLEC 23.45    CLEC    CLEC 
5 201402281552 201402281600   CSWS_AECC_LA 23.45  CSWS_AECC_LA   AECC_CSWS 

此外,我只能得到從服務器返回的頁面的HTML,它不包含所有的結果。實際上,該頁面底部有JavaScript箭頭按鈕,可讓我在網頁中選中所有結果。

在網頁本身,要從下拉菜單中選擇後查看結果,我必須點擊'查看'按鈕。有沒有一種方法可以在R中複製這個以獲取我的'03'參數作爲查詢發送到服務器以將新的HTML返回到網頁?

如果我能做到這一點,我可以寫些東西來「推」頁面箭頭。

+0

我希望別人會給你一個更樂觀的理由,但我的建議是不要做它。在selenium驅動程序中使用python,即使你事先不知道python也會容易得多。我說這是一個熱愛R並試圖將其用於一切的人,但在這種情況下,我認爲這不是適合工作的正確工具。 – Ista

+0

謝謝Ista ......在進入這個小小的泡菜之前,我從來沒有聽說過硒。你認爲他們在通過jdharrison建議的R包使用Python驅動程序方面是一個優勢嗎? – sclarky

回答

2

您可以使用Selenium。見http://johndharrison.github.io/RSelenium/。免責聲明我是RSelenium軟件包的作者。在操作的基本小品可以在RSelenium basics進行查看和 RSelenium: Testing Shiny apps

require(RSelenium) 
# RSelenium::startServer() # if needed 
remDr <- remoteDriver() 
remDr$open() 
remDr$setImplicitWaitTimeout(3000) 
remDr$navigate("http://www.spp.org/LIP.asp") 
remDr$switchToFrame("content_frame") 
dateElem <- remDr$findElement(using = "id", "txtLIPDate") # select the date 
dateRequired <- "01/14/2014" 
dateElem$clearElement() 
dateElem$sendKeysToElement(list("01/14/2014", key = "enter")) # send a date to app 
hourElem <- remDr$findElement(using = "css selector", '#ddlHour [value="5"]') # select the 5th hour 
hourElem$clickElement() # select this hour 
buttonElem <-remDr$findElement(using = "id", "cmdView") 
buttonElem$clickElement() # click the view button 

#Sys.sleep(5) 
tableElem <- remDr$findElement(using = "id", "dgLIP") 
readHTMLTable(htmlParse(tableElem$getElementAttribute("outerHTML")[[1]])) 

[1] "tableElem$getElementAttribute(\"outerHTML\")" 
$dgLIP 
V1   V2     V3 V4     V5     V6 
1 Publish Date Price Date    PNode Price  Parent PNode Settlement Location 
2 201401132252 201401132300     AECI 19.14    AECI    AECI 
3 201401132252 201401132300     AMRN 18.87    AMRN    AMRN 
4 201401132252 201401132300     BLKW 20.28    BLKW    BLKW 
5 201401132252 201401132300     CLEC 18.99    CLEC    CLEC 
6 201401132252 201401132300   CSWS_AECC_LA 19.77  CSWS_AECC_LA   AECC_CSWS 
7 201401132252 201401132300 CSWS_GREEN_LIGHT_LA 18.5 CSWS_GREEN_LIGHT_LA  GSEC_GL_CSWS 
8 201401132252 201401132300    CSWS_LA 19.01    CSWS_LA   AEPM_CSWS 
9 201401132252 201401132300    CSWS_LA 19.01    CSWS_LA   AEP_LOSS 
10 201401132252 201401132300   CSWS_OMPA_LA 18.66  CSWS_OMPA_LA   OMPA_CSWS 
11 201401132252 201401132300  CSWS_TENASKA_LA 18.95  CSWS_TENASKA_LA  GATEWAY_LOAD 
12 201401132252 201401132300  CSWS112_WGORLD1 18.7    CSWS_LA   AEPM_CSWS 
13 201401132252 201401132300  CSWS112_WGORLD1 18.7    CSWS_LA   AEP_LOSS 
14 201401132252 201401132300  CSWS116PEORILD1 18.9    CSWS_LA   AEPM_CSWS 
15 201401132252 201401132300  CSWS116PEORILD1 18.9    CSWS_LA   AEP_LOSS 
16 201401132252 201401132300 CSWS121EASTLDXFL1 18.92    CSWS_LA   AEPM_CSWS 
17 201401132252 201401132300 CSWS121EASTLDXFL1 18.92    CSWS_LA   AEP_LOSS 
18 201401132252 201401132300  CSWS121LYNN4LD1 18.91    CSWS_LA   AEPM_CSWS 
19 201401132252 201401132300  CSWS121LYNN4LD1 18.91    CSWS_LA   AEP_LOSS 
20 201401132252 201401132300 CSWS12TH_STLD69_12 18.92    CSWS_LA   AEPM_CSWS 
21 201401132252 201401132300 CSWS12TH_STLD69_12 18.92    CSWS_LA   AEP_LOSS 
22 201401132252 201401132300 CSWS12TH_STLD69_12_2 18.92    CSWS_LA   AEPM_CSWS 
23 201401132252 201401132300 CSWS12TH_STLD69_12_2 18.92    CSWS_LA   AEP_LOSS 
24 201401132252 201401132300  CSWS136_YALELD1 18.9    CSWS_LA   AEPM_CSWS 
25 201401132252 201401132300  CSWS136_YALELD1 18.9    CSWS_LA   AEP_LOSS 
26 201401132252 201401132300 CSWS141_PINELDXFMR1 19.09    CSWS_LA   AEPM_CSWS 
27   < >   <NA>     <NA> <NA>    <NA>    <NA> 
+0

好吧,我很好奇!我打算給它一個星期一 – sclarky

+0

我被困在'remDr $ open()'得到錯誤'錯誤在函數(類型,味精,asError = TRUE):無法連接到主機'。我使用devtools包在R中安裝並從GitHub下載。 – sclarky

+0

@sclarky您需要運行硒服務器請參閱RSelenium基礎知識小插件。 – jdharrison

0

對於後人,我想也就忍了我使用結果頁面的頁面點擊代碼(有沒有「全部顯示」選項) 。我有RSelenium點擊所有頁面,直到不再有「前進點擊」選項。在每一頁刮擦HTML表到一個列表:

# Get the first page of results 
tableElem <- remDr$findElement(using = "id", "dgLIP") 
tmp <- readHTMLTable(htmlParse(tableElem$getElementAttribute("outerHTML")[[1]])) 
hourlyData <- list() 
# Save the first table without the last row, which is gibberish 
hourlyData[[1]] <- tmp[[1]][-27,] 

# Click the 'greater than' arrow javascript href element to get to next page 
acc <- 2 
while("javascript:__doPostBack('dgLIP$_ctl29$_ctl1','')" %in% unlist(lapply(remDr$findElements("css selector", "[href]"), function(x){x$getElementAttribute("href")}))) { 
    webElems <- remDr$findElements("css selector", "[href]") 
    clickers <- unlist(lapply(webElems, function(x){x$getElementAttribute("href")})) 
    pager <- webElems[[which(clickers == "javascript:__doPostBack('dgLIP$_ctl29$_ctl1','')")]] 
    pager$clickElement() 
    tableElem <- remDr$findElement(using = "id", "dgLIP") 
    tmp <- readHTMLTable(htmlParse(tableElem$getElementAttribute("outerHTML")[[1]])) 
    hourlyData[[acc]] <- tmp[[1]] 
    acc <- acc + 1 
    Sys.sleep(3) 
}