2016-03-19 168 views
0

我想訪問維基百科文章的XML輸出中的修訂細節。換句話說,我想data.frame結構與一個行的每個revision(其作爲我明白樹結構應該//page/revision)和用於子列表revision(的每個元素的一列重要的有可能是在不同的revision子列表的不同元素)。XML:訪問具有相同名稱的嵌套項目

數據:

require(XML) 
require(httr) 
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export", 
      body = "pages=Euroswydd&offset=1&limit=2&action=submit") 
stop_for_status(r) 
xml <- content(r, "text") 
xml_data <- xmlToList(xml) 
str(xml_data) 

其輸出

List of 3 
$ siteinfo:List of 6 
..$ sitename : chr "Wikipedia" 
..$ dbname : chr "enwiki" 
..$ base  : chr "https://en.wikipedia.org/wiki/Main_Page" 
..$ generator : chr "MediaWiki 1.27.0-wmf.17" 
..$ case  : chr "first-letter" 
..$ namespaces:List of 35 
... [not of interest] ... 
$ page :List of 5 
..$ title : chr "Euroswydd" 
..$ ns  : chr "0" 
..$ id  : chr "86146" 
..$ revision:List of 7 
.. ..$ id   : chr "4028683" 
.. ..$ timestamp : chr "2002-09-16T03:24:52Z" 
.. ..$ contributor:List of 2 
.. .. ..$ username: chr "TUF-KAT" 
.. .. ..$ id  : chr "8351" 
.. ..$ model  : chr "wikitext" 
.. ..$ format  : chr "text/x-wiki" 
.. ..$ text  :List of 2 
.. .. ..$ text : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him. Their twin childr"| __truncated__ 
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot 
.. .. .. .. [email protected] .Data: chr [1:2] "preserve" "163" 
.. ..$ sha1  : chr "ivzrvt6jgoga4ndtrdmz5ldg5elfoma" 
..$ revision:List of 9 
.. ..$ id   : chr "9228569" 
.. ..$ parentid : chr "4028683" 
.. ..$ timestamp : chr "2004-06-11T02:22:33Z" 
.. ..$ contributor:List of 2 
.. .. ..$ username: chr "Gtrmp" 
.. .. ..$ id  : chr "38984" 
.. ..$ minor  : NULL 
.. ..$ model  : chr "wikitext" 
.. ..$ format  : chr "text/x-wiki" 
.. ..$ text  :List of 2 
.. .. ..$ text : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him. Their twin childr"| __truncated__ 
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot 
.. .. .. .. [email protected] .Data: chr [1:2] "preserve" "203" 
.. ..$ sha1  : chr "kwd09htf87bjc51y2z9ykpnasu7nqle" 
$ .attrs :Formal class 'XMLAttributes' [package "XML"] with 1 slot 
.. [email protected] .Data: chr [1:3] "http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" "0.10" "en" 

現在

我可以xml_data[['page']][['revision']]進入第一次修訂目錄。但如何可以訪問第二個revision

+0

對於XML處理,XPATH是一個很好的方法。使用'xml_data [['page']] [['revision']]'訪問第一個修訂版列表,使用Iterator和' - > next()'您將獲得第二個元素。 看看那個代碼:http://stackoverflow.com/a/14448325/390462 – ThierryB

+0

看看'WikipediR'包;它有'revision_content'和'revision_diff'功能。 – alistaire

回答

1

Usind rvest你可以做這樣的事情如下:

輔助功能:

parse_nested <- function(x, prefix = ''){ 
    kids = x %>% xml_children() 
    ind = which(sapply(kids, xml_length) != 0) 
    if(!length(ind)){ 
    return(setNames(kids %>% xml_text(), 
        paste0(prefix,kids %>% xml_name()))) 
    } 
    nested = parse_nested(kids[ind], 
         prefix = paste0(prefix, kids[ind] %>% xml_name(), "_")) 
    unnested = setNames(kids[-ind] %>% xml_text(), 
         paste0(prefix, kids[-ind] %>% xml_name())) 
    as.list(c(unnested, nested)) 
} 

實際代碼:

require(httr) 
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export", 
      body = "pages=Euroswydd&offset=1&limit=2&action=submit") 

require(rvest) 
doc <- read_html(r) 
doc %>% 
    html_nodes("revision") %>% 
    lapply(parse_nested) %>% #Parse each revison seperately 
    data.table::rbindlist(fill=TRUE) #combine them 

結果(data.table):

 id   timestamp model  format --- 
1: 4028683 2002-09-16T03:24:52Z wikitext text/x-wiki --- 
2: 9228569 2004-06-11T02:22:33Z wikitext text/x-wiki --- 

感謝@Arun指出,data.table::rbindlist接受列表。

plyr::rbind.fill可用作data.table::rbindlist的替代方案。

+0

我不能讓它工作。你作爲'doc'傳遞了什麼?一個'xml2'節點對象? – CptNemo

相關問題