用R颳去維基百科做出清單和數據框

我想刮Vancouver olympic games維基百科條目。不幸的是它不是一個很好的表格格式用R颳去維基百科做出清單和數據框

我想創建一個有2列的數據框：Nation和number of athletes。

在這一點上我有

library(XML) 
library(RCurl) 

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010" 
webpage <- getURL(path) 
webpage <- readLines(tc <- textConnection(webpage)); close(tc) 

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE) 

# Extract table header and contents 
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue) 
country<-tablehead[31]

其中國家

> country 
[1] "\n Afrique du Sud (2)\n Albanie (1)\n Algérie (1)\n Allemagne (153)\n Andorre (6)\n Argentine (7)\n Arménie (4)\n Australie (41)\n Autriche (82)\n Azerbaïdjan (2)\n Belgique (8)\n Bermudes (1)\n Biélorussie (50)\n Bosnie-Herzégovine (5)\n Brésil (5)\n Bulgarie (18)\n Canada (206)\n Chili (3)\n Chine (90)\n Chypre (2)\n Colombie (1)\n\n\n\n Corée du Nord (2)\n Corée du Sud (46)\n Croatie (18)\n Danemark (18)\n Espagne (18)\n Estonie (32)\n États-Unis (216)\n Éthiopie (1)\n Finlande (95)\n France (108)\n Géorgie (12)\n Ghana (1)\n Grande-Bretagne (52)\n Grèce (7)\n Hong Kong (1)\n Hongrie (16)\n Îles Caïmans (1)\n Inde (3)\n Iran (4)\n Irlande (6)\n Islande (4)\n\n\n\n Israël (3)\n Italie (109)\n Jamaïque (1)\n Japon (94)\n Kazakhstan (38)\n Kirghizistan (2)\n Lettonie (54)\n Liban (3)\n Liechtenstein (6)\n Lituanie (6)\n Macédoine (3)\n Moldavie (8)\n Maroc (1)\n Mexique (1)\n Monaco (3)\n Monténégro (1)\n Mongolie (2)\n Népal (1)\n Norvège (99)\n Nouvelle-Zélande (16)\n\n\n\n Ouzbékistan (3)\n Pakistan (1)\n Pays-Bas (34)\n Pérou (3)\n Pologne (50)\n Portugal (1)\n République tchèque (93)\n Roumanie (29)\n Russie (179)\n Saint-Marin (1)\n Sénégal (1)\n Serbie (10)\n Slovaquie (73)\n Slovénie (49)\n Suède (108)\n Suisse (146)\n Tadjikistan (1)\n Taipei chinois (1)\n Turquie (5)\n Ukraine (47)\n\n"

我已經試過

str_detect(country,"\n") 
country<-str_split(country,"\n")

，但數據是非常髒，而且它不工作很好。有什麼建議麼？

來源

2014-03-26 delaye

你需要解釋什麼不起作用以及你想要解決什麼問題。 –

可能性是使用正則表達式。我從來沒有做過有R但stringr似乎被推薦圖書館： Extract a regular expression match in R version 2.10（http://cran.r-project.org/web/packages/stringr/stringr.pdf）

編輯：出現代碼爲我

library(XML) 
library(RCurl) 
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))) 
library(stringr) 

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010" 
webpage <- getURL(path) 
webpage <- readLines(tc <- textConnection(webpage)); close(tc) 

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE, encoding = "UTF-8") 
# Extract table header and contents 
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue) 
country<-tablehead[31] 

country<-strsplit(country,"\n") 

# extract country 
bar <- function(x) str_trim(str_extract(x, "[^(]*"), side = "both") 
res1 <- sapply(country[[1]], bar)  
# extract nb of athletes 
foo <- function(x) str_trim(str_match(x, "\\((.*?)\\)")[[2]], side = "both") 
res2 <- sapply(country[[1]], foo) 
# build df 
res2 <- as.numeric(res2) 
df <- data.frame(res1, res2) 
df <- df[res1 != "",] 
# inspect df 
nrow(df) 
summary(df)

來源

2014-03-26 14:02:27 etna

嗨etna，這似乎是這樣的東西！但在這一點上......我有兩個變量的數據框中有1個觀察。 – delaye

奇怪的是，我重現了整個代碼，因爲它似乎在這裏工作，並添加了一些檢查結果的行。 – etna

我不明白昨天是怎麼回事！謝謝etna它的工作！ – delaye

工作嘗試

library(plyr) 
country <- str_split(country,"\n")[[1]] 
df <- ldply(country[[1]], function(z) data.frame(str_extract(z, "[A-Za-z]+")[[1]], str_extract(z, "[0-9]+"))) 
head(na.omit(df)) 

            a      b 
2       Afrique      2 
3       Albanie      1 
4        Alg      1 
5       Allemagne      153 
6       Andorre      6 
7       Argentine      7

來源

2014-03-26 14:17:02 sckott

感謝斯科特，但它在家裏不工作'head（na.omit（df））'exit'... [1] str_extract.z .... A.Za.z ...... 1 .. str_extract.z .... 0.9 .... <0 lignes>（ou'row.names'de longueur nulle）' – delaye

什麼意思？對象df看起來像？ – sckott

'str_extract.z .... A.Za.z ...... 1 .. str_extract.z .... 0.9 .... ' – delaye

用R颳去維基百科做出清單和數據框

回答

相關問題