使用Rcurl刮取數據

我想使用Rcurl和XML從以下url中抓取一些數據。使用Rcurl刮取數據

http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?&lang=

的數據範圍爲「2000年6月5日」爲「2013-12-30」，也就是超過10000頁。

此頁面中的元素與數據關聯。

<form name="report1_turnPageForm" method=post  
action="http://datacenter.mep.gov.cn:80/.../air.../air_dairy.jsp..." style="display:none"> 
<input type=hidden name=reportParamsId value=122169> 
<input type=hidden name=report1_currPage value="1"> 
<input type=hidden name=report1_cachedId value=53661> 
</form>

和鏈接看起來也像這樣

http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?city&startdate=2013-12-15&enddate=2013-12-30&page=31

這兒有開始日期和結束日期及網頁..

然後我就開始抓取網頁。

require(RCurl) 
require(XML) 
k = postForm("http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?&lang=") 
k = iconv(k, 'gbk', 'utf-8') 
k = htmlParse(k, asText = TRUE, encoding = 'utf-8')

然後..我不知道下一步該怎麼做..我不知道我是否在正確的軌道上？

我也試過這個

k = sapply(getNodeSet(doc = k, path = "//font[@color='#0000FF' and @size='2']"), 
     xmlValue)[1:24]

它不工作..

能給一些建議嗎？非常感謝！

Scrapy和beautifulsoup解決方案也非常受歡迎！

來源

2014-02-24 Bigchao

如果XML就足夠了，也許這將是一個起點：

require(XML) 

url <- "http://datacenter.mep.gov.cn/report/air_daily/air_dairy.jsp?city&startdate=2013-12-15&enddate=2013-12-30&page=%d" 
pages <- 2 
tabs <- vector("list", length=pages) 

for (page in 1:pages) { 
    doc <- htmlParse(paste(suppressWarnings(readLines(sprintf(url, 
                  page), 
                encoding="UTF-8")), 
         collapse="\n")) 
    tabs[[page]] <- readHTMLTable(doc, 
           header=TRUE, 
           which=4) # readHTMLTable(doc)[["report1"]] 
} 

do.call(rbind.data.frame, tabs) # output

來源

2014-02-24 13:18:56 lukeA

非常感謝！ @lukeA。我認爲在XML包中有一些關於中文字符編碼的問題。所有返回的數據都是一些奇怪的字符。你對如何處理它有一些建議嗎？ – Bigchao

使用Rcurl刮取數據

回答

相關問題