2012-11-27 43 views
8

我有一個XML文檔文件。該文件的部分看起來像這樣:如何從xml文件創建R數據框

-<attr> 
    <attrlabl>COUNTY</attrlabl> 
    <attrdef>County abbreviation</attrdef> 
    <attrtype>Text</attrtype> 
    <attwidth>1</attwidth> 
    <atnumdec>0</atnumdec> 
    -<attrdomv> 
     -<edom> 
      <edomv>C</edomv> 
      <edomvd>Clackamas County</edomvd> 
      <edomvds/> 
     </edom> 
     -<edom> 
      <edomv>M</edomv> 
      <edomvd>Multnomah County</edomvd> 
      <edomvds/> 
     </edom> 
     -<edom> 
      <edomv>W</edomv> 
      <edomvd>Washington County</edomvd> 
      <edomvds/> 
     </edom> 
    </attrdomv> 
</attr> 

從這個XML文件,我想創建attrlabl,attrdef,attrtype和attrdomv的列R的數據框。請注意,attrdomv列應該包含類別變量的所有級別。數據幀應該是這樣的:

attrlabl attrdef    attrtype attrdomv 
COUNTY  County abbreviation Text  C Clackamas County; M Multnomah County; W Washington County 

我有一個不完整的代碼是這樣的:

doc <- xmlParse("taxlots.shp.xml") 
dataDictionary <- xmlToDataFrame(getNodeSet(doc,"//attrlabl")) 

能不能請你完成我的R代碼裏面?我感謝任何幫助!

+1

你能給一個有效的xml文件嗎? – agstudy

+0

@agstudy:你能告訴我如何發送我的XML文件給你? – POTENZA

+0

你不能在這裏,但你可以使用像SkyDrive的文件上傳服務,併發布鏈接file agstudy

回答

9

假設這是正確的taxlots.shp.xml文件:

<attr> 
    <attrlabl>COUNTY</attrlabl> 
    <attrdef>County abbreviation</attrdef> 
    <attrtype>Text</attrtype> 
    <attwidth>1</attwidth> 
    <atnumdec>0</atnumdec> 
    <attrdomv> 
     <edom> 
      <edomv>C</edomv> 
      <edomvd>Clackamas County</edomvd> 
      <edomvds/> 
     </edom> 
     <edom> 
      <edomv>M</edomv> 
      <edomvd>Multnomah County</edomvd> 
      <edomvds/> 
     </edom> 
     <edom> 
      <edomv>W</edomv> 
      <edomvd>Washington County</edomvd> 
      <edomvds/> 
     </edom> 
    </attrdomv> 
</attr> 

你幾乎有:

doc <- xmlParse("taxlots.shp.xml") 
xmlToDataFrame(nodes=getNodeSet(doc1,"//attr"))[c("attrlabl","attrdef","attrtype","attrdomv")] 
    attrlabl    attrdef attrtype            attrdomv 
1 COUNTY County abbreviation  Text CClackamas CountyMMultnomah CountyWWashington County 

但最後一個字段不是你想要的格式。要做到這一點,需要一些額外的步驟:

step1 <- xmlToDataFrame(nodes=getNodeSet(doc1,"//attrdomv/edom")) 
step1 
    edomv   edomvd edomvds 
1  C Clackamas County   
2  M Multnomah County   
3  W Washington County 

step2 <- paste(paste(step1$edomv, step1$edomvd, sep=" "), collapse="; ") 
step2 
[1] "C Clackamas County; M Multnomah County; W Washington County" 

cbind(xmlToDataFrame(nodes= getNodeSet(doc1, "//attr"))[c("attrlabl", "attrdef", "attrtype")], 
     attrdomv= step2) 
    attrlabl    attrdef attrtype              attrdomv 
1 COUNTY County abbreviation  Text C Clackamas County; M Multnomah County; W Washington County 
+0

upvote coz漂亮,比xpathSApply短! – agstudy