2016-01-05 30 views
1

我目前正在努力從一個XML文件中的數據導入到R.導入XML數據與R與遺漏值

的XML文件有,我想在一個數據幀的單排多個記錄。示例記錄:

<rec resultID="5"> 
    <header shortDbName="psyh" longDbName="PsycINFO" uiTerm="2015-99210-426"> 
    <controlInfo> 
     <bkinfo> 
     <btl>The impact of zoo live animal presentations on students' propensity to engage in conservation behaviors.</btl> 
     <aug /> 
     <isbn>9781321491562</isbn> 
     </bkinfo> 
     <chapinfo /> 
     <revinfo /> 
     <dissinfo> 
     <disstl>The impact of zoo live animal presentations on students' propensity to engage in conservation behaviors.</disstl> 
     </dissinfo> 
     <jinfo> 
     <jtl>Dissertation Abstracts International Section A: Humanities and Social Sciences</jtl> 
     <issn type="Print">04194209</issn> 
     </jinfo> 
     <pubinfo> 
     <dt year="2015" month="01" day="01">20150101</dt> 
     <vid>76</vid> 
     <iid>5-A(E)</iid> 
     </pubinfo> 
     <artinfo> 
     <ui type="umi">AAI3671924</ui> 
     <tig> 
      <atl>The impact of zoo live animal presentations on students' propensity to engage in conservation behaviors.</atl> 
     </tig> 
     <aug> 
      <au>Kirchgessner, Mandy L.</au> 
     </aug> 
     <sug> 
      <subj type="major">Animals</subj> 
      <subj type="major">Hope</subj> 
      <subj type="minor">Conservation (Ecological Behavior)</subj> 
      <subj type="minor">Outreach Programs</subj> 
      <subj type="minor">Psychological Development</subj> 
     </sug> 
     <ab>Zoos frequently deploy outreach programs, often called "Zoomobiles," to schools; these programs incorporate zoo resources, such as natural artifacts and live animals, in order to teach standardized content and in hopes of inspiring students to protect the environment. Educational research at zoos is relatively rare, and research on their outreach programs is non-existent. This leaves zoos vulnerable to criticisms as they have little to no evidence that their strategies support their missions, which target conservation outcomes. This study seeks to shed light on this gap by analyzing the impact that live animals have on offsite program participants' interests in animals and subsequent conservation outcomes. The theoretical lens is derived from the field of Conservation Psychology, which believes personal connections with nature serve as the motivational component to engagement with conservation efforts. Using pre, post, and delayed surveys combined with Zoomobile presentation observations, I analyzed the roles of sensory experiences in students' (N=197) development of animal interest and conservation behaviors. Results suggest that touching even one animal during presentations has a significant impact on conservation intents and sustainment of those intents. Although results on interest outcomes are conflicting, this study points to ways this kind of research can make significant contributions to zoo learning outcomes. Other significant variables, such as emotional predispositions and animal-related excitement, are discussed in light of future research directions. (PsycINFO Database Record (c) 2015 APA, all rights reserved)</ab> 
     <pubtype>Dissertation Abstract</pubtype> 
     <doctype>Dissertation</doctype> 
     </artinfo> 
     <language>English</language> 
    </controlInfo> 
    <displayInfo> 
     <pLink> 
     <url>http://search.ebscohost.com/login.aspx?direct=true&amp;db=psyh&amp;AN=2015-99210-426&amp;site=ehost-live&amp;scope=site</url> 
     </pLink> 
    </displayInfo> 
    </header> 
</rec> 

我嘗試了以下方法,但它可以讓大數據集變慢。此外,當節點缺少數據時,我希望函數爲給定的行/記錄返回「NA」,但是我不認爲這可以通過此函數完成?

title <- xmlToDataFrame(nodes = getNodeSet(xmltop, "//atl"), stringsAsFactors = FALSE) 
author <- xmlToDataFrame(nodes = getNodeSet(xmltop, "//artinfo/aug/au[1]"), stringsAsFactors = FALSE) 
abstract <- xmlToDataFrame(nodes = getNodeSet(xmltop, "//artinfo/ab[1]"), stringsAsFactors = FALSE) 
year <- xmlToDataFrame(nodes = getNodeSet(xmltop, "//pubinfo/dt"), stringsAsFactors = FALSE) 

我試圖按照指示在這裏R dataframe from XML when values are multiple or missing沒有成功:

doc = xmlParse(file.choose(), useInternalNodes = TRUE) 

do.call(rbind, xpathApply(xmltop, "/rec", function(node) { 
    auth <- xmlValue(node[["artinfo/aug/au[1]"]]) 
    if (is.null(auth)) auth <- NA 
    year <- xmlValue(node[["//pubinfo/dt"]]) 
    if (is.null(year)) year <- NA 
    title <- xmlValue(node[["//atl"]]) 
    if (is.null(title)) title <- NA 
    abstract <- xmlValue(node[["//artinfo/ab[1]"]]) 
    if (is.null(abstract)) abstract <- NA 

    data.frame(auth, year, title, abstract, stringsAsFactors = FALSE) 

})) 

我仍然不是很acquitanted使用XPath和R但我想有某種問題與「節點「位上面?

+0

你有一個通用的語言(C#, Java,Perl,PHP,Python,甚至包含MS Excel/Ac的VBA )與R安裝?這些語言可以運行XSLT,它可以使用'xmlToDataFrame()'將XML重新設計爲更簡單的R導入格式? – Parfait

+0

xmtodataframeframe工程(我用它上面)。我有VBA/Python。我嘗試導入使用Excel,但是這使用多行每個/ REC節點wheras我只想要每行節點一行。 – user3084100

回答

1

如前所述,可以考慮使用xmlToDataFrame()運行XSLT來簡化你的XML爲行和列的一胎化水平,這可以很容易地導入到R:

<row> 
    <column>data</column> 
    <column>data</column> 
    <column>data</column> 
<row> 
<row> 
    <column>data</column> 
    <column>data</column> 
    <column>data</column> 
<row> 

R有尚未有一個普遍的XSLT 1.0處理器。幸運的是,包括C#,Java,Python,PHP,Perl,VB在內的大多數通用語言都可以運行XSLT腳本來重新格式化/重新設計複雜的XML數據。以下是帶有最終R導入行的Python和VBA腳本。

XSLT腳本(保存爲的.xsl或.xslt文件)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> 
<xsl:output version="1.0" encoding="UTF-8" indent="yes" /> 
<xsl:strip-space elements="*"/> 

    <!-- Identity Transform --> 
    <xsl:template match="@*|node()">  
     <xsl:apply-templates select="@*|node()"/>  
    </xsl:template> 

    <!-- Removes Element/Keeps Children Data --> 
    <xsl:template match="rec">  
     <xsl:apply-templates />  
    </xsl:template> 

    <!-- Replaces Element/Keeps Children Data --> 
    <xsl:template match="rec"> 
    <data> 
     <xsl:apply-templates /> 
    </data> 
    </xsl:template> 

    <!-- Extracts Needed Elements --> 
    <xsl:template match="controlInfo"> 
    <row> 
     <title><xsl:value-of select="artinfo/tig/atl"/></title> 
     <author><xsl:value-of select="artinfo/aug/au"/></author> 
     <abstract><xsl:value-of select="artinfo/ab"/></abstract> 
     <year><xsl:value-of select="pubinfo/dt"/></year> 
    </row> 
    </xsl:template> 

<!-- Removes Element (empty template) --> 
<xsl:template match="displayInfo"/> 

</xsl:transform> 

的Python腳本(使用lxml模塊)

import lxml.etree as ET 

# LOAD XML AND XSL FILES 
dom = ET.parse('Input.xml')) 
xslt = ET.parse('XSLTScript.xsl')) 

# TRANSFORMS INPUT 
transform = ET.XSLT(xslt) 
newdom = transform(dom) 

# OUTPUTS FILE 
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True) 
print(tree_out.decode("utf-8")) 

xmlfile = open('Output.xml','wb') 
xmlfile.write(tree_out) 
xmlfile.close() 

VBA宏(使用MSXML對象)

Sub TransformXML() 
    Dim wb As Workbook 
    Dim xmlDoc As Object, xslDoc As Object, newDoc As Object 
    Dim strPath As String, xslFile As String 
    Dim i As Long 

    ' INITIALIZE MSXML OBJECTS ' 
    Set xmlDoc = CreateObject("MSXML2.DOMDocument") 
    Set xslDoc = CreateObject("MSXML2.DOMDocument") 
    Set newDoc = CreateObject("MSXML2.DOMDocument") 

    ' LOAD XML AND XSL ' 
    xmlDoc.async = False 
    xmlDoc.Load "C:\Path\To\Input.xml" 

    xslDoc.async = False 
    xslDoc.Load "C:\Path\To\XSLTScript.xsl" 

    ' TRANSFORM XML ' 
    xmlDoc.transformNodeToObject xslDoc, newDoc 

    ' OUTPUT XML ' 
    newDoc.Save "C:\Path\To\Output.xml" 

    Set xmlDoc = Nothing 
    Set xslDoc = Nothing 
    Set newDoc = Nothing 

End Sub 

XML變換的輸出

<?xml version='1.0' encoding='UTF-8'?> 
    <data> 
     <row> 
     <title>The impact of zoo live animal presentations on students' 
       propensity to engage in conservation behaviors.</title> 
     <author>Kirchgessner, Mandy L.</author> 
     <abstract>Zoos frequently deploy outreach programs, often called 
        "Zoomobiles," to schools; these programs incorporate zoo resources, such as 
        natural artifacts and live animals, in order to teach standardized content 
        and in hopes of inspiring students to protect the environment. Educational 
        research at zoos is relatively rare, and research on their outreach programs 
        is non-existent. This leaves zoos vulnerable to criticisms as they have 
        little to no evidence that their strategies support their missions, which 
        target conservation outcomes. This study seeks to shed light on this gap by 
        analyzing the impact that live animals have on offsite program participants' 
        interests in animals and subsequent conservation outcomes. The theoretical 
        lens is derived from the field of Conservation Psychology, which believes 
        personal connections with nature serve as the motivational component to 
        engagement with conservation efforts. Using pre, post, and delayed surveys 
        combined with Zoomobile presentation observations, I analyzed the roles of 
        sensory experiences in students' (N=197) development of animal interest and 
        conservation behaviors. Results suggest that touching even one animal during 
        presentations has a significant impact on conservation intents and 
        sustainment of those intents. Although results on interest outcomes are 
        conflicting, this study points to ways this kind of research can make 
        significant contributions to zoo learning outcomes. Other significant 
        variables, such as emotional predispositions and animal-related excitement, 
        are discussed in light of future research directions. (PsycINFO Database 
        Record (c) 2015 APA, all rights reserved)</abstract> 
     <year>20150101</year> 
     </row> 
    </data> 

[R腳本(使用XML封裝)

library(XML) 
doc <- xmlToDataFrame("Output.xml")    # MISSING NODES RENDERS AS EMPTY