2013-05-30 179 views
5

我正在嘗試使用R來抓取網站上的數據。使用R從aspx網站上刮取

  1. 我想去貫穿於各個環節下頁: http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House票據

  2. 與當前狀態顯示「發送到省長」只選項目。例如,http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013

  3. 然後在STATUS TEXT中爲以下子句「通過最終讀數」取消單元格。例如:通過SD 2中與代表Fale,Jordan,Tsuji投票贊成並保留的最終解讀;代表(s)Cabanilla,森川,大城,Tokioka投票否(4)並且沒有原諒(0)。

我一直在使用前面的例子中使用包Rcurl和XML(以R)試過了,但我不知道如何正確地使用它們的aspx網站。所以我喜歡的是:1.關於如何構建這樣的代碼的一些建議。 2.建議如何學習執行此類任務所需的知識。

感謝您的幫助,

湯姆

+0

我建議你通過這個線程看看這裏,我試圖瞭解湊一個網站。 http://www.talkstats.com/showthread.php/26153-Still-trying-to-learn-to-scrape?highlight=still+learning+to+scrape –

+0

我在這上面花了幾個小時,這並不容易: (你可以獲取第一頁的內容,但第二個不接受我傳遞'__VIEWSTATE'和一些其他參數[如這裏所示](http://stackoverflow.com/questions/15853204/how-我可以到'resp <-GET(「http://capitol.hawaii.gov/advreports/)。 advreport.aspx?year = 2013&report = deadline&rpt_type =&measuretype = hb&title = House Bills「); writeBin(content(resp,'raw'),tf); readHTMLTable(tf)$ GridViewReports',但第二個站點殺死它:( –

回答

5
require(httr) 
require(XML) 

basePage <- "http://capitol.hawaii.gov" 

h <- handle(basePage) 

GET(handle = h) 

res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House") 

# parse content for "Transmitted to Governor" text 
resXML <- htmlParse(content(res, as = "text")) 
resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]') 
appRows <-sapply(resTable, xmlValue) 
include <- grepl("Transmitted to Governor", appRows) 
resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href') 

appUrls <- resUrls[include] 

# look at just the first 

res <- GET(handle = h, path = appUrls[1]) 

resXML <- htmlParse(content(res, as = "text")) 


xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue) 

[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, 
Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, 
Tokioka voting no (4) and none excused (0)." 

讓包httr處理通過建立handle所有的後臺工作。

如果你想運行在所有92個鏈接:

# get all the links returned as a list (will take sometime) 
# print statement included for sanity 
res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x))); 
            GET(handle = h, path = x)}) 
resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))}) 
appString <- sapply(resXML, function(x){ 
        xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue) 
         }) 


head(appString) 

> head(appString) 
$href 
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."             
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."         
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 1 Excused: Ige."      
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."       
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)." 

$href 
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none." 
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)." 
+0

你,user1609452。這是我理解如何刮aspx頁面的一個不錯的起點 – user2300643

+0

感謝我也!這太棒了:) –

+0

對不起,user1609452。是否有可能列出所有相關的網址鼠她一次只有一個?再次感謝! – user2300643