2015-10-01 62 views
1

我有一個包含這樣的數據的XML:從XML屬性中的R到data.frame

<?xml version="1.0" encoding="utf-8"?> 
<posts> 
    <row Id="1" PostTypeId="1" 
     AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
     ViewCount="1647" Body="some text;" OwnerUserId="8" 
     LastActivityDate="2010-09-15T21:08:26.077" 
     Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" /> 
[...] 

(該數據集是dump from stats.stackexchange.com

如何獲得與屬性的data.frame 「Id」和「PostTypeId」?

我一直在試圖與XML庫,但我得到一個地步,我不知道該如何解開值:

library(XML) 

xml <- xmlTreeParse("Posts.xml",useInternalNode=TRUE) 
types <- getNodeSet(xml, '//row/@PostTypeId') 

> types[1] 
[[1]] 
PostTypeId 
     "1" 
attr(,"class") 
[1] "XMLAttributeValue" 

這將是得到這些的投影中適當的R方式從XML中將兩列變成一個data.frame?

+0

當我下載文件時,它沒有縫到xml ......什麼是編碼? – Rentrop

+1

@ Floo0這是一個[7-zip](http://www.7-zip.org/)存檔。 – hrbrmstr

回答

2

使用rvest(這大約是xml2的包裝),你可以如下做到這一點:

require(rvest) 
require(magrittr) 
doc <- xml('<posts> 
    <row Id="1" PostTypeId="1" 
AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
ViewCount="1647" Body="some text;" OwnerUserId="8" 
LastActivityDate="2010-09-15T21:08:26.077" 
Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />') 

rows <- doc %>% xml_nodes("row") 
data.frame(
    Id = rows %>% xml_attr("id"), 
    PostTypeId = rows %>% xml_attr("posttypeid") 
) 

,導致:

Id PostTypeId 
1 1   1 

如果你把Comments.xml

data.frame(
    Id = rows %>% xml_attr("id"), 
    PostTypeId = rows %>% xml_attr("postid"), 
    score = rows %>% xml_attr("score") 
) 

您會收到:

> head(dat) 
    Id PostTypeId score 
1 1   3  5 
2 2   5  0 
3 3   9  0 
4 4   5 11 
5 5   3  1 
6 6   14  9 
+0

非常好。任何過濾行的方式?例如,只用'PostTypeId'1或2添加行? – vtortola

+0

這就是我會做的...像'dat [哪些(dat $ PostTypeId == 2),''''''你可能想要使用像'dplyr'或'data.table'這樣的軟件包來獲取更多數據操作 – Rentrop

+0

或者你可以照顧它的插播和記憶/時間效率(見我的答案)。 – hrbrmstr

2

這其實是一個很大的用例在XMLxmlEventParse功能。這是一個200 MB以上的文件,你最不想做的事情就是浪費內存(XML解析出衆的內存密集型),浪費時間多次通過節點。

通過使用xmlEventParse你也可以過濾你做什麼或不需要什麼,你也可以在裏面找到一個進度條,這樣你就可以看到發生了什麼。

library(XML) 
library(data.table) 

# get the # of <rows> quickly; you can approximate if you don't know the 
# number or can't run this and then chop down the size of the data.frame 
# afterwards 
system("grep -c '<row' ~/Desktop/p1.xml") 
## 128010 

n <- 128010 

# pre-populate a data.frame 
# you could also just write this data out to a file and read it back in 
# which would negate the need to use global variables or pre-allocate 
# a data.frame 
dat <- data.frame(id=rep(NA_character_, n), 
        post_type_id=rep(NA_character_, n), 
        stringsAsFactors=FALSE) 

# setup a progress bar since there are alot of nodes 
pb <- txtProgressBar(min=0, max=n, style=3) 

# this function will be called for each <row> 
# again, you could write to a file/database/whatever vs do this 
# data.frame population 
idx <- 1 
process_row <- function(node, tribs) { 
    # update the progress bar 
    setTxtProgressBar(pb, idx) 
    # get our data (you can filter here) 
    dat[idx, "id"] <<- tribs["Id"] 
    dat[idx, "post_type_id"] <<- tribs["PostTypeId"] 
    # update the index 
    idx <<- idx + 1 
} 

# start the parser 
info <- xmlEventParse("Posts.xml", list(row=process_row)) 

# close up the progress bar 
close(pb) 

head(dat) 
## id post_type_id 
## 1 1   1 
## 2 2   1 
## 3 3   1 
## 4 4   1 
## 5 5   2 
## 6 6   1