2014-03-26 50 views
3

我想刮Vancouver olympic games維基百科條目。不幸的是它不是一個很好的表格格式用R颳去維基百科做出清單和數據框

我想創建一個有2列的數據框:Nationnumber of athletes

在這一點上我有

library(XML) 
library(RCurl) 

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010" 
webpage <- getURL(path) 
webpage <- readLines(tc <- textConnection(webpage)); close(tc) 

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE) 

# Extract table header and contents 
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue) 
country<-tablehead[31] 

其中國家

> country 
[1] "\n Afrique du Sud (2)\n Albanie (1)\n Algérie (1)\n Allemagne (153)\n Andorre (6)\n Argentine (7)\n Arménie (4)\n Australie (41)\n Autriche (82)\n Azerbaïdjan (2)\n Belgique (8)\n Bermudes (1)\n Biélorussie (50)\n Bosnie-Herzégovine (5)\n Brésil (5)\n Bulgarie (18)\n Canada (206)\n Chili (3)\n Chine (90)\n Chypre (2)\n Colombie (1)\n\n\n\n Corée du Nord (2)\n Corée du Sud (46)\n Croatie (18)\n Danemark (18)\n Espagne (18)\n Estonie (32)\n États-Unis (216)\n Éthiopie (1)\n Finlande (95)\n France (108)\n Géorgie (12)\n Ghana (1)\n Grande-Bretagne (52)\n Grèce (7)\n Hong Kong (1)\n Hongrie (16)\n Îles Caïmans (1)\n Inde (3)\n Iran (4)\n Irlande (6)\n Islande (4)\n\n\n\n Israël (3)\n Italie (109)\n Jamaïque (1)\n Japon (94)\n Kazakhstan (38)\n Kirghizistan (2)\n Lettonie (54)\n Liban (3)\n Liechtenstein (6)\n Lituanie (6)\n Macédoine (3)\n Moldavie (8)\n Maroc (1)\n Mexique (1)\n Monaco (3)\n Monténégro (1)\n Mongolie (2)\n Népal (1)\n Norvège (99)\n Nouvelle-Zélande (16)\n\n\n\n Ouzbékistan (3)\n Pakistan (1)\n Pays-Bas (34)\n Pérou (3)\n Pologne (50)\n Portugal (1)\n République tchèque (93)\n Roumanie (29)\n Russie (179)\n Saint-Marin (1)\n Sénégal (1)\n Serbie (10)\n Slovaquie (73)\n Slovénie (49)\n Suède (108)\n Suisse (146)\n Tadjikistan (1)\n Taipei chinois (1)\n Turquie (5)\n Ukraine (47)\n\n" 

我已經試過

str_detect(country,"\n") 
country<-str_split(country,"\n") 

,但數據是非常髒,而且它不工作很好。有什麼建議麼?

+0

你需要解釋什麼不起作用以及你想要解決什麼問題。 –

回答

1

可能性是使用正則表達式。我從來沒有做過有R但stringr似乎被推薦圖書館: Extract a regular expression match in R version 2.10http://cran.r-project.org/web/packages/stringr/stringr.pdf

編輯:出現代碼爲我

library(XML) 
library(RCurl) 
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))) 
library(stringr) 

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010" 
webpage <- getURL(path) 
webpage <- readLines(tc <- textConnection(webpage)); close(tc) 

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE, encoding = "UTF-8") 
# Extract table header and contents 
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue) 
country<-tablehead[31] 

country<-strsplit(country,"\n") 

# extract country 
bar <- function(x) str_trim(str_extract(x, "[^(]*"), side = "both") 
res1 <- sapply(country[[1]], bar)  
# extract nb of athletes 
foo <- function(x) str_trim(str_match(x, "\\((.*?)\\)")[[2]], side = "both") 
res2 <- sapply(country[[1]], foo) 
# build df 
res2 <- as.numeric(res2) 
df <- data.frame(res1, res2) 
df <- df[res1 != "",] 
# inspect df 
nrow(df) 
summary(df) 
+0

嗨etna,這似乎是這樣的東西!但在這一點上......我有兩個變量的數據框中有1個觀察。 – delaye

+0

奇怪的是,我重現了整個代碼,因爲它似乎在這裏工作,並添加了一些檢查結果的行。 – etna

+0

我不明白昨天是怎麼回事!謝謝etna它的工作! – delaye

0

工作嘗試

library(plyr) 
country <- str_split(country,"\n")[[1]] 
df <- ldply(country[[1]], function(z) data.frame(str_extract(z, "[A-Za-z]+")[[1]], str_extract(z, "[0-9]+"))) 
head(na.omit(df)) 

            a      b 
2       Afrique      2 
3       Albanie      1 
4        Alg      1 
5       Allemagne      153 
6       Andorre      6 
7       Argentine      7 
+0

感謝斯科特,但它在家裏不工作'head(na.omit(df))'exit'... [1] str_extract.z .... A.Za.z ...... 1 .. str_extract.z .... 0.9 .... <0 lignes>(ou'row.names'de longueur nulle)' – delaye

+0

什麼意思?對象df看起來像? – sckott

+0

'str_extract.z .... A.Za.z ...... 1 .. str_extract.z .... 0.9 .... ' – delaye