2016-12-19 46 views
0

我試圖將XML文件轉換爲數據框。R中的XML轉換,最後位

示例XML文件:

<games id="32134"> 
    <game id="3962920" xsid="0"> 
    <time>2016-11-26T15:30:00+00:00</time> 
    <group id="33765">Roses</group> 
    <hteam id="2228">BlackSavers</hteam> 
    <ateam id="226150">Regeton</ateam> 
    <results> 
    </results> 
    <server sid="126" name="reg"> 
     <offer id="548331136"> 
      <states i="0" time="2016-11-26T10:03:56+00:00" starting_time="2016-11-26T15:30:00+00:00"> 
       <s1>2.750</s1> 
       <s2>3.600</s2> 
       <s3>2.100</s3> 
      </states> 
      <states i="1" time="2016-11-25T17:05:07+00:00" starting_time="2016-11-26T15:30:00+00:00"> 
       <s1>3.000</s1> 
       <s2>3.600</s2> 
       <s3>2.000</s3> 
      </states> 
     </offer> 
    </server> 
    <server bid="221" name="razor"> 
     <offer id="548415893"> 
      <states i="0" time="2016-11-26T10:11:26+00:00" starting_time="2016-11-26T15:30:00+00:00"> 
       <s1>653.000</s1> 
       <s2>873.600</s2> 
       <s3>225.100</s3> 
      </states> 
      <states i="1" time="2016-11-26T10:07:39+00:00" starting_time="2016-11-26T15:30:00+00:00"> 
       <s1>323.000</s1> 
       <s2>321.750</s2> 
       <s3>211.050</s3> 
      </states> 
      <states i="2" time="2016-11-25T19:54:20+00:00" starting_time="2016-11-26T15:30:00+00:00"> 
       <s1>223.100</s1> 
       <s2>322.600</s2> 
       <s3>232.050</s3> 
      </states> 
     </offer> 
    </server> 
    <server bid="291" name="nagie"> 
     <offer id="548454059"> 
      <states i="0" time="2016-11-26T13:21:08+00:00" starting_time="2016-11-26T15:30:00+00:00"> 
       <s1>323.000</s1> 
       <s2>123.400</s2> 
       <s3>342.100</s3> 
      </states> 
      <states i="1" time="2016-11-26T10:07:02+00:00" starting_time="2016-11-26T15:30:00+00:00"> 
       <s1>123.000</s1> 
       <s2>323.500</s2> 
       <s3>342.050</s3> 
      </states> 
      <states i="2" time="2016-11-25T21:35:50+00:00" starting_time="2016-11-26T15:30:00+00:00"> 
       <s1>374.000</s1> 
       <s2>349.600</s2> 
       <s3>200.000</s3> 
      </states> 
     </offer> 
    </server> 
</game> 
</games> 

當前代碼:

df <- do.call("rbind", xpathApply(doc, "//game", function(m) { 
data.frame(
game_id = xmlAttrs(m)["id"], 
t(xpathSApply(m, "group", function(g) { 
    c(
    group_id = xmlAttrs(g)["id"], 
    group = xmlValue(g[["group"]]) 
) 
})), 
t(xpathSApply(m, "server",function(b){ 
    sid <- xmlAttrs(b)[["sid"]] 
    name <- xmlAttrs(b)[["name"]] 
    xpathSApply(b, "offer",function(of){ 
    c(
     sid = sid, 
     name = name, 
     id = xmlAttrs(of)[["id"]], 
     do.call(cbind, xpathApply(of, "states",function(o){ 
     c(s1 <- xmlValue(o[["s1"]]), 
      s2 <- xmlValue(o[["s2"]]), 
      s3 <- xmlValue(o[["s3"]]) 
     ) 
     })) 
    )}) 

    }))) 

})) 

期望中的數據幀輸出:

Desired format

我的問題是,我無法弄清楚如何在數據框中也放置狀態。其他級別已經在,他們確實有用。我只需要幫助最後一塊。

這些職位對我幫助很大 xml with nested siblings to data frame in R Transforming data from xml into R dataframe

謝謝!

+0

所需的數據幀格式的一個小例子將是有益的。 – hrbrmstr

+0

''等和''標籤似乎是錯別字/放錯位置了,他們沒有關閉''標籤? –

+0

,並且是第一個'server'中的'sid'真的是'bid'? –

回答

0

您可以關注(2)在這裏的答案:Transforming data from xml into R dataframe。這個想法是搜索最深的節點,這裏states,然後使用xmlParent計算祖先。從這一點來說,這是例行公事。例如,只用幾個字段(您可以添加其他部分):

library(XML) 
doc <- xpathTreeParse("games.xml", useInternalNodes = TRUE) 

do.call("rbind", xpathApply(doc, "//states", function(states) { 
    offer <- xmlParent(states) 
    server <- xmlParent(offer) 
    game <- xmlParent(server) 
    games <- xmlParent(game) 
    data.frame(
    gamesId = xmlAttrs(games)[["id"]], 
    gameId = xmlAttrs(game)[["id"]], 
    groupid = xmlAttrs(game[["group"]])[["id"]], 
    groupname = xmlValue(game[["group"]]), 
    offerId = xmlAttrs(offer)[["id"]], 
    states_i = as.numeric(xmlAttrs(states)[["i"]]), 
    s1 = as.numeric(xmlValue(states[["s1"]])), 
    s2 = as.numeric(xmlValue(states[["s2"]])), 
    stringsAsFactors = FALSE) 
})) 

,並提供:

gamesId gameId groupid groupname offerId states_i  s1  s2 
1 32134 3962920 33765  Roses 548331136  0 2.75 3.60 
2 32134 3962920 33765  Roses 548331136  1 3.00 3.60 
3 32134 3962920 33765  Roses 548415893  0 653.00 873.60 
4 32134 3962920 33765  Roses 548415893  1 323.00 321.75 
5 32134 3962920 33765  Roses 548415893  2 223.10 322.60 
6 32134 3962920 33765  Roses 548454059  0 323.00 123.40 
7 32134 3962920 33765  Roses 548454059  1 123.00 323.50 
8 32134 3962920 33765  Roses 548454059  2 374.00 349.60 
+0

這很有用,謝謝。 儘管我必須讀取超過100個xml文件,並且性能確實很重要。正如您在文章中指出的那樣,(1)解決方案運行得更快。你介意在你的文章中添加該解決方案嗎? 謝謝 – ponthu

+0

這絕對是更多的工作,但如果你願意這樣做,那麼它只是從頂部開始並運行嵌套的'xpathApply',如圖所示。我會首先嚐試基於(2)的解決方案,並評估性能是否足夠,因爲你的時間很重要。 –

1

一種做法是,無需擔心文件的「幾何」

game_id <- as.integer(xpathSApply(doc, "//game", xmlGetAttr, "id")) 
server_id <- as.integer(xpathSApply(doc, "//server", xmlGetAttr, "bid")) 
offer_id <- as.integer(xpathSApply(doc, "//offer", xmlGetAttr, "id")) 
s1 <- as.numeric(xpathSApply(doc, "//s1", xmlValue)) 

然後提取嵌套的複製節點

geo <- function(elt, node) length(getNodeSet(elt, node)) 
offer_geo <- sapply(getNodeSet(doc, "//offer"), geo, "states") 

的幾何形狀來提取值,並通過總結把東西放在一起或採取嵌套幾何圖形的產品

data.frame(
    game_id = rep(game_id, sum(offer_geo)), 
    server_id = rep(server_id, offer_geo), 
    offer_id = rep(offer_id, offer_geo), 
    s1=s1)