從XML屬性中的R到data.frame

我有一個包含這樣的數據的XML：從XML屬性中的R到data.frame

<?xml version="1.0" encoding="utf-8"?> 
<posts> 
    <row Id="1" PostTypeId="1" 
     AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
     ViewCount="1647" Body="some text;" OwnerUserId="8" 
     LastActivityDate="2010-09-15T21:08:26.077" 
     Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" /> 
[...]

（該數據集是dump from stats.stackexchange.com）

如何獲得與屬性的data.frame 「Id」和「PostTypeId」？

我一直在試圖與XML庫，但我得到一個地步，我不知道該如何解開值：

library(XML) 

xml <- xmlTreeParse("Posts.xml",useInternalNode=TRUE) 
types <- getNodeSet(xml, '//row/@PostTypeId') 

> types[1] 
[[1]] 
PostTypeId 
     "1" 
attr(,"class") 
[1] "XMLAttributeValue"

這將是得到這些的投影中適當的R方式從XML中將兩列變成一個data.frame？

來源

2015-10-01 vtortola

當我下載文件時，它沒有縫到xml ......什麼是編碼？ – Rentrop

@ Floo0這是一個[7-zip]（http://www.7-zip.org/）存檔。 – hrbrmstr

使用rvest（這大約是xml2的包裝），你可以如下做到這一點：

require(rvest) 
require(magrittr) 
doc <- xml('<posts> 
    <row Id="1" PostTypeId="1" 
AcceptedAnswerId="15" CreationDate="2010-07-19T19:12:12.510" Score="27" 
ViewCount="1647" Body="some text;" OwnerUserId="8" 
LastActivityDate="2010-09-15T21:08:26.077" 
Title="title" AnswerCount="5" CommentCount="1" FavoriteCount="17" />') 

rows <- doc %>% xml_nodes("row") 
data.frame(
    Id = rows %>% xml_attr("id"), 
    PostTypeId = rows %>% xml_attr("posttypeid") 
)

，導致：

Id PostTypeId 
1 1   1

如果你把Comments.xml
與

data.frame(
    Id = rows %>% xml_attr("id"), 
    PostTypeId = rows %>% xml_attr("postid"), 
    score = rows %>% xml_attr("score") 
)

您會收到：

> head(dat) 
    Id PostTypeId score 
1 1   3  5 
2 2   5  0 
3 3   9  0 
4 4   5 11 
5 5   3  1 
6 6   14  9

來源

2015-10-01 21:18:27 Rentrop

非常好。任何過濾行的方式？例如，只用'PostTypeId'1或2添加行？ – vtortola

這就是我會做的...像'dat [哪些（dat $ PostTypeId == 2），''''''你可能想要使用像'dplyr'或'data.table'這樣的軟件包來獲取更多數據操作 – Rentrop

或者你可以照顧它的插播和記憶/時間效率（見我的答案）。 – hrbrmstr

這其實是一個很大的用例在XML包xmlEventParse功能。這是一個200 MB以上的文件，你最不想做的事情就是浪費內存（XML解析出衆的內存密集型），浪費時間多次通過節點。

通過使用xmlEventParse你也可以過濾你做什麼或不需要什麼，你也可以在裏面找到一個進度條，這樣你就可以看到發生了什麼。

library(XML) 
library(data.table) 

# get the # of <rows> quickly; you can approximate if you don't know the 
# number or can't run this and then chop down the size of the data.frame 
# afterwards 
system("grep -c '<row' ~/Desktop/p1.xml") 
## 128010 

n <- 128010 

# pre-populate a data.frame 
# you could also just write this data out to a file and read it back in 
# which would negate the need to use global variables or pre-allocate 
# a data.frame 
dat <- data.frame(id=rep(NA_character_, n), 
        post_type_id=rep(NA_character_, n), 
        stringsAsFactors=FALSE) 

# setup a progress bar since there are alot of nodes 
pb <- txtProgressBar(min=0, max=n, style=3) 

# this function will be called for each <row> 
# again, you could write to a file/database/whatever vs do this 
# data.frame population 
idx <- 1 
process_row <- function(node, tribs) { 
    # update the progress bar 
    setTxtProgressBar(pb, idx) 
    # get our data (you can filter here) 
    dat[idx, "id"] <<- tribs["Id"] 
    dat[idx, "post_type_id"] <<- tribs["PostTypeId"] 
    # update the index 
    idx <<- idx + 1 
} 

# start the parser 
info <- xmlEventParse("Posts.xml", list(row=process_row)) 

# close up the progress bar 
close(pb) 

head(dat) 
## id post_type_id 
## 1 1   1 
## 2 2   1 
## 3 3   1 
## 4 4   1 
## 5 5   2 
## 6 6   1

來源

2015-10-01 21:55:08 hrbrmstr

從XML屬性中的R到data.frame

回答

相關問題