2012-07-29 25 views
12

看起來網站阻止了Curl的直接訪問。將現場奧運獎牌數據下載到R

library(XML) 
library(RCurl) 
theurl <- "http://www.london2012.com/medals/medal-count/" 
page <- getURL(theurl) 

page # fail 
[1] "<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don't have permission to access \"http&#58;&#47;&#47;www&#46;london2012&#46;com&#47;medals&#47;medal&#45;count&#47;\" on this server.<P>\nReference&#32;&#35;18&#46;358a503f&#46;1343590091&#46;c056ae2\n</BODY>\n</HTML>\n" 

讓我們試着看看我們是否可以直接從表中訪問它。

page <- readHTMLTable(theurl) 

沒有運氣Error in htmlParse(doc) : error in creating parser for http://www.london2012.com/medals/medal-count/

你會如何去獲得此表爲R?


更新:在迴應評論和騷擾時,僞造用戶代理字符串以獲取內容。但是readHTMLtable返回一個錯誤。

page <- getURLContent(theurl, useragent="Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2") 
+0

山貓,似乎被阻止爲好。 – 2012-07-29 19:48:47

+0

由於在Firefox中加載頁面,查看源代碼並保存到磁盤? – 2012-07-29 19:58:52

+0

通過getURL,您可以指定一個錯誤的用戶代理字符串,該字符串用於獲取數據。但是readHTMLTable仍然不能很好地發揮出來。它返回一個錯誤('名稱錯誤(ans)= header:'names'屬性[13]必須和vector [7]'長度相同)不太清楚如何調試。 – 2012-07-29 20:03:12

回答

12

它看起來像這樣工作的:

rr <- readHTMLTable(page,header=FALSE) 
rr2 <- setNames(rr[[1]], 
       c("rank","country","gold","silver","bronze","junk","total")) 
rr3 <- subset(rr2,select=-junk) 
## oops, numbers all got turned into factors ... 
tmpf <- function(x) { as.numeric(as.character(x)) } 
rr3[,-2] <- sapply(rr3[,-2],tmpf)    
head(rr3) 
## rank        country gold silver bronze total 
## 1 1    People's Republic of China 6  4  2 12 
## 2 2    United States of America 3  5  3 11 
## 3 3         Italy 2  3  2  7 
## 4 4      Republic of Korea 2  1  2  5 
## 5 5         France 2  1  1  4 
## 6 6 Democratic People's Republic of Korea 2  0  1  3 
with(rr3,dotchart(total,country)) 
+0

我認爲你可以在'readHTMLTable'調用中使用'stringsAsFactors = FALSE'。 – GSee 2012-07-29 20:43:38

+0

好的,但我想我仍然必須將這些列轉換爲數字? – 2012-07-29 20:44:32

+0

你只是通過代碼來看看它是否有'thead'? – 2012-07-29 20:50:28

12

這裏是我想出了使用正則表達式。非常具體,絕對不比在其他答案中使用readHTMLTable更好。更表明你可以去很遠與文本挖掘在R:

# file <- "~/Documents/R/medals.html" 
# page <- readChar(file,file.info(file)$size) 

library(RCurl) 
theurl <- "http://www.london2012.com/medals/medal-count/" 
page <- getURLContent(theurl, useragent="Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2") 


# Remove html tags: 
page <- gsub("<(.|\n)*?>","",page) 
# Remove newlines and tabs: 
page <- gsub("\\n","",page) 

# match table: 
page <- regmatches(page,regexpr("(?<=Total).*(?=Detailed)",page,perl=TRUE)) 

# Extract country+medals+rank 
codes <-regmatches(page,gregexpr("\\d+[^\\r]*\\d+",page,perl=TRUE))[[1]] 
codes <- codes[seq(1,length(codes)-2,by=2)] 

# Extract country and medals: 
Names <- gsub("\\d","",codes) 
Medals <- sapply(regmatches(codes,gregexpr("\\d",codes)),function(x)x[(length(x)-2):length(x)]) 

# Create data frame: 
data.frame(
    Country = Names, 
    Gold = as.numeric(Medals[1,]), 
    Silver = as.numeric(Medals[2,]), 
    Bronze = as.numeric(Medals[3,])) 

和輸出:

        Country Gold Silver Bronze 
1    People's Republic of China 6  4  2 
2    United States of America 3  5  3 
3         Italy 2  3  2 
4      Republic of Korea 2  1  2 
5         France 2  1  1 
6 Democratic People's Republic of Korea 2  0  1 
7        Kazakhstan 2  0  0 
8        Australia 1  1  1 
9         Brazil 1  1  1 
10        Hungary 1  1  1 
11       Netherlands 1  1  0 
12      Russian Federation 1  0  3 
13        Georgia 1  0  0 
14       South Africa 1  0  0 
15         Japan 0  2  3 
16       Great Britain 0  1  1 
17        Colombia 0  1  0 
18         Cuba 0  1  0 
19         Poland 0  1  0 
20        Romania 0  1  0 
21    Taipei (Chinese Taipei) 0  1  0 
22        Azerbaijan 0  0  1 
23        Belgium 0  0  1 
24         Canada 0  0  1 
25     Republic of Moldova 0  0  1 
26         Norway 0  0  1 
27         Serbia 0  0  1 
28        Slovakia 0  0  1 
29        Ukraine 0  0  1 
30        Uzbekistan 0  0  1 
+0

+1爲正則表達式技能,即使我經常使用它,我仍然很困惑。 – 2012-07-29 20:51:36

+1

當然,它總是一個很好的技能:http://xkcd.com/208/ – 2012-07-30 09:06:34