這裏是我想出了使用正則表達式。非常具體,絕對不比在其他答案中使用readHTMLTable
更好。更表明你可以去很遠與文本挖掘在R:
# file <- "~/Documents/R/medals.html"
# page <- readChar(file,file.info(file)$size)
library(RCurl)
theurl <- "http://www.london2012.com/medals/medal-count/"
page <- getURLContent(theurl, useragent="Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2")
# Remove html tags:
page <- gsub("<(.|\n)*?>","",page)
# Remove newlines and tabs:
page <- gsub("\\n","",page)
# match table:
page <- regmatches(page,regexpr("(?<=Total).*(?=Detailed)",page,perl=TRUE))
# Extract country+medals+rank
codes <-regmatches(page,gregexpr("\\d+[^\\r]*\\d+",page,perl=TRUE))[[1]]
codes <- codes[seq(1,length(codes)-2,by=2)]
# Extract country and medals:
Names <- gsub("\\d","",codes)
Medals <- sapply(regmatches(codes,gregexpr("\\d",codes)),function(x)x[(length(x)-2):length(x)])
# Create data frame:
data.frame(
Country = Names,
Gold = as.numeric(Medals[1,]),
Silver = as.numeric(Medals[2,]),
Bronze = as.numeric(Medals[3,]))
和輸出:
Country Gold Silver Bronze
1 People's Republic of China 6 4 2
2 United States of America 3 5 3
3 Italy 2 3 2
4 Republic of Korea 2 1 2
5 France 2 1 1
6 Democratic People's Republic of Korea 2 0 1
7 Kazakhstan 2 0 0
8 Australia 1 1 1
9 Brazil 1 1 1
10 Hungary 1 1 1
11 Netherlands 1 1 0
12 Russian Federation 1 0 3
13 Georgia 1 0 0
14 South Africa 1 0 0
15 Japan 0 2 3
16 Great Britain 0 1 1
17 Colombia 0 1 0
18 Cuba 0 1 0
19 Poland 0 1 0
20 Romania 0 1 0
21 Taipei (Chinese Taipei) 0 1 0
22 Azerbaijan 0 0 1
23 Belgium 0 0 1
24 Canada 0 0 1
25 Republic of Moldova 0 0 1
26 Norway 0 0 1
27 Serbia 0 0 1
28 Slovakia 0 0 1
29 Ukraine 0 0 1
30 Uzbekistan 0 0 1
山貓,似乎被阻止爲好。 – 2012-07-29 19:48:47
由於在Firefox中加載頁面,查看源代碼並保存到磁盤? – 2012-07-29 19:58:52
通過getURL,您可以指定一個錯誤的用戶代理字符串,該字符串用於獲取數據。但是readHTMLTable仍然不能很好地發揮出來。它返回一個錯誤('名稱錯誤(ans)= header:'names'屬性[13]必須和vector [7]'長度相同)不太清楚如何調試。 – 2012-07-29 20:03:12