我想訪問維基百科文章的XML輸出中的修訂細節。換句話說,我想data.frame
結構與一個行的每個revision
(其作爲我明白樹結構應該是//page/revision
)和用於子列表revision
(的每個元素的一列重要的有可能是在不同的revision
子列表的不同元素)。XML:訪問具有相同名稱的嵌套項目
數據:
require(XML)
require(httr)
r <- POST("http://en.wikipedia.org/w/index.php?title=Special:Export",
body = "pages=Euroswydd&offset=1&limit=2&action=submit")
stop_for_status(r)
xml <- content(r, "text")
xml_data <- xmlToList(xml)
str(xml_data)
其輸出
List of 3
$ siteinfo:List of 6
..$ sitename : chr "Wikipedia"
..$ dbname : chr "enwiki"
..$ base : chr "https://en.wikipedia.org/wiki/Main_Page"
..$ generator : chr "MediaWiki 1.27.0-wmf.17"
..$ case : chr "first-letter"
..$ namespaces:List of 35
... [not of interest] ...
$ page :List of 5
..$ title : chr "Euroswydd"
..$ ns : chr "0"
..$ id : chr "86146"
..$ revision:List of 7
.. ..$ id : chr "4028683"
.. ..$ timestamp : chr "2002-09-16T03:24:52Z"
.. ..$ contributor:List of 2
.. .. ..$ username: chr "TUF-KAT"
.. .. ..$ id : chr "8351"
.. ..$ model : chr "wikitext"
.. ..$ format : chr "text/x-wiki"
.. ..$ text :List of 2
.. .. ..$ text : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him. Their twin childr"| __truncated__
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. .. .. .. [email protected] .Data: chr [1:2] "preserve" "163"
.. ..$ sha1 : chr "ivzrvt6jgoga4ndtrdmz5ldg5elfoma"
..$ revision:List of 9
.. ..$ id : chr "9228569"
.. ..$ parentid : chr "4028683"
.. ..$ timestamp : chr "2004-06-11T02:22:33Z"
.. ..$ contributor:List of 2
.. .. ..$ username: chr "Gtrmp"
.. .. ..$ id : chr "38984"
.. ..$ minor : NULL
.. ..$ model : chr "wikitext"
.. ..$ format : chr "text/x-wiki"
.. ..$ text :List of 2
.. .. ..$ text : chr "In [[Celtic mythology]], '''Eurossydd''' held [[Llyr]] hostage until his wife, [[Penarddun]] slept with him. Their twin childr"| __truncated__
.. .. ..$ .attrs:Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. .. .. .. [email protected] .Data: chr [1:2] "preserve" "203"
.. ..$ sha1 : chr "kwd09htf87bjc51y2z9ykpnasu7nqle"
$ .attrs :Formal class 'XMLAttributes' [package "XML"] with 1 slot
.. [email protected] .Data: chr [1:3] "http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" "0.10" "en"
現在
我可以xml_data[['page']][['revision']]
進入第一次修訂目錄。但如何可以訪問第二個revision
?
對於XML處理,XPATH是一個很好的方法。使用'xml_data [['page']] [['revision']]'訪問第一個修訂版列表,使用Iterator和' - > next()'您將獲得第二個元素。 看看那個代碼:http://stackoverflow.com/a/14448325/390462 – ThierryB
看看'WikipediR'包;它有'revision_content'和'revision_diff'功能。 – alistaire