使用R從aspx網站上刮取

我正在嘗試使用R來抓取網站上的數據。使用R從aspx網站上刮取

我想去貫穿於各個環節下頁： http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House票據
與當前狀態顯示「發送到省長」只選項目。例如，http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013
然後在STATUS TEXT中爲以下子句「通過最終讀數」取消單元格。例如：通過SD 2中與代表Fale，Jordan，Tsuji投票贊成並保留的最終解讀;代表（s）Cabanilla，森川，大城，Tokioka投票否（4）並且沒有原諒（0）。

我一直在使用前面的例子中使用包Rcurl和XML（以R）試過了，但我不知道如何正確地使用它們的aspx網站。所以我喜歡的是：1.關於如何構建這樣的代碼的一些建議。 2.建議如何學習執行此類任務所需的知識。

感謝您的幫助，

湯姆

2013-05-30 user2300643

我建議你通過這個線程看看這裏，我試圖瞭解湊一個網站。 http://www.talkstats.com/showthread.php/26153-Still-trying-to-learn-to-scrape?highlight=still+learning+to+scrape –

我在這上面花了幾個小時，這並不容易：（你可以獲取第一頁的內容，但第二個不接受我傳遞'__VIEWSTATE'和一些其他參數[如這裏所示]（http://stackoverflow.com/questions/15853204/how-我可以到'resp <-GET（「http://capitol.hawaii.gov/advreports/）。 advreport.aspx？year = 2013＆report = deadline＆rpt_type =＆measuretype = hb＆title = House Bills「）; writeBin（content（resp，'raw'），tf）; readHTMLTable（tf）$ GridViewReports'，但第二個站點殺死它:( –

require(httr) 
require(XML) 

basePage <- "http://capitol.hawaii.gov" 

h <- handle(basePage) 

GET(handle = h) 

res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House") 

# parse content for "Transmitted to Governor" text 
resXML <- htmlParse(content(res, as = "text")) 
resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]') 
appRows <-sapply(resTable, xmlValue) 
include <- grepl("Transmitted to Governor", appRows) 
resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href') 

appUrls <- resUrls[include] 

# look at just the first 

res <- GET(handle = h, path = appUrls[1]) 

resXML <- htmlParse(content(res, as = "text")) 


xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue) 

[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, 
Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, 
Tokioka voting no (4) and none excused (0)."

讓包httr處理通過建立handle所有的後臺工作。

如果你想運行在所有92個鏈接：

# get all the links returned as a list (will take sometime) 
# print statement included for sanity 
res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x))); 
            GET(handle = h, path = x)}) 
resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))}) 
appString <- sapply(resXML, function(x){ 
        xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue) 
         }) 


head(appString) 

> head(appString) 
$href 
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."             
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."         
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 1 Excused: Ige."      
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."       
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none." 
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)."

來源

2013-05-30 05:14:14 user1609452

你，user1609452。這是我理解如何刮aspx頁面的一個不錯的起點 – user2300643

感謝我也！這太棒了:) –

對不起，user1609452。是否有可能列出所有相關的網址鼠她一次只有一個？再次感謝！ – user2300643

使用R從aspx網站上刮取

回答

相關問題