2016-01-24 36 views
0

我已經成功獲取示例1 xml作爲R中的數據框對象,但遇到示例2的麻煩。有沒有人對R代碼將數據從mtcars.xml轉換爲數據框有所建議?在R中解析xml - 返回數據框對象

實施例1)

library(XML) 
# Save the URL of the xml file in a variable 

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml" 

# Use the xmlTreePares-function to parse xml file directly from the web 

xmlfile <- xmlTreeParse(xml.url) 

# Use the xmlRoot-function to access the top node 
xmltop = xmlRoot(xmlfile) 
# have a look at the XML-code of the first subnodes: 
print(xmltop)[1:2] 


# To extract the XML-values from the document, use xmlSApply: 

plantcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) 

示例2)

library(XML) 
# Save the URL of the xml file in a variable 

doc <- xmlTreeParse(system.file("exampleData", "mtcars.xml", package="XML")) 


xmlfile <- xmlTreeParse(doc) 

# Use the xmlRoot-function to access the top node 
xmltop = xmlRoot(xmlfile) 
# have a look at the XML-code of the first subnodes: 
print(xmltop)[1:2] 


# To extract the XML-values from the document, use xmlSApply: 

mtcarscat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) 
+0

對於第一個,'xmlToDataFrame('http://www.w3schools.com/xml/plant_catalog.xml')'一氣呵成。 – alistaire

回答

1

嘗試xpathSApply

library(XML) 

path <- system.file("exampleData", "mtcars.xml", package="XML") 
doc <- xmlTreeParse(path, useInternal = TRUE) 
root <- xmlRoot(doc) 

read.table(text = xpathSApply(root, "//record", xmlValue), 
      col.names = xpathSApply(root, "//variable", xmlValue)) 

,並提供:

mpg cyl disp hp drat wt qsec vs am gear carb 
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 
... etc ... 
1

下面是與xml2一個辦法:

library(xml2) 
library(purrr) 
library(dplyr) 

catalog_url <- "http://www.w3schools.com/xml/plant_catalog.xml" 
doc <- read_xml(catalog_url) 

# get all the "records" 
plants <- xml_find_all(doc, ".//PLANT") 

# get all the field names 
kids <- xml_name(xml_children(plants[1])) 

# make a data frame 
# - iterate over each record 
# - in each record grab each field 
# - turn each row into a data frame 
# - bind all the data frames together 

map_df(plants, function(plant) { 
    rbind_list(as.list(setNames(map_chr(kids, function(kid) { 
    xml_text(xml_find_one(plant, sprintf(".//%s", kid))) 
    }), kids))) 
}) 

## Source: local data frame [36 x 6] 
## 
##     COMMON    BOTANICAL ZONE  LIGHT PRICE AVAILABILITY 
##     (chr)     (chr) (chr)  (chr) (chr)  (chr) 
## 1   Bloodroot Sanguinaria canadensis  4 Mostly Shady $2.44  031599 
## 2   Columbine Aquilegia canadensis  3 Mostly Shady $9.37  030699 
## 3  Marsh Marigold  Caltha palustris  4 Mostly Sunny $6.81  051799 
## 4    Cowslip  Caltha palustris  4 Mostly Shady $9.90  030699 
## 5 Dutchman's-Breeches Dicentra cucullaria  3 Mostly Shady $6.44  012099 
## 6   Ginger, Wild  Asarum canadense  3 Mostly Shady $9.03  041899 
## 7    Hepatica  Hepatica americana  4 Mostly Shady $4.45  012699 
## 8   Liverleaf  Hepatica americana  4 Mostly Shady $3.99  010299 
## 9 Jack-In-The-Pulpit Arisaema triphyllum  4 Mostly Shady $3.23  020199 
## 10   Mayapple Podophyllum peltatum  3 Mostly Shady $2.98  060599 
## ..     ...     ... ...   ... ...   ... 

它可以作出更穩健一點通過尋找所有可能的孩子取名字(一些「記錄」可能有更多或更少的孩子),但它足以讓這個例子。這樣做(按名稱獲取每個元素的值)確保它們以正確的順序返回(元素的順序不是保證)。