如何修改R中的頂級XML節點？

我想添加一個屬性到XML文件的最頂層節點，然後保存該文件。我已經嘗試過所有可以考慮的xpath和子集的組合，但似乎無法使其工作。用一個簡單的例子：如何修改R中的頂級XML節點？

xml_string = c(
'<?xml version="1.0" encoding="UTF-8"?>', 
'<retrieval-response status = "found">', 
     '<coredata>', 
      '<id type = "author" >12345</id>', 
     '</coredata>', 
     '<author>', 
      '<first>John</first>', 
      '<last>Doe</last>', 
     '</author>', 
'</retrieval-response>') 

# parse xml content 
xml = xmlParse(xml_string)

當我嘗試

xmlAttrs(xml["/retrieval-response"][[1]]) <- c(id = 12345)

我得到一個錯誤：

object of type 'externalptr' is not subsettable

然而，屬性插入，所以我不知道我做錯了。（更多背景：這是來自Scopus API的數據的簡化版本，我將數以千計的xml文件結構相似，但id在「coredata」節點，它是「作者」節點的同胞其中包含所有的數據，所以當我使用SAS將組合XML文檔編譯爲數據集時，id和數據之間沒有鏈接，我希望將id添加到層次結構的頂部會導致它傳播到所有其他級別）。

來源

2015-11-10 Sarah Hailey

這可以很容易地用[XSLT]（http://www.w3schools.com/xsl/）完成，該語言重新構造XML文檔以適應任何細微的需求。如果[SAS]（https://www.sas.com/en_us/home.html）指的是統計軟件包，那麼我們可以使用[proc xsl]（http://support.sas.com/文檔/ CDL/EN的/ proc/61895/HTML /默認/ viewer.htm＃a003356144.htm）。請使用SAS標記此文件，並提供XML文檔的實際樣本和所需的數據集結果。 – Parfait

[Here]（https://dl.dropboxusercontent.com/u/8428744/example_file.xml）是一個示例文件。我有超過11000個這樣的文件，我用一個名爲mergex.exe的程序將它們合併成一個大的XML文件。然後我使用SAS的XML映射器將XML文件導入SAS。非常方便，但XML文件的結構使得不可能將id鏈接到作者信息。理想情況下，我會讓SAS中生成的每個數據集都包含作者ID（我使用的是從XML文件中抽取的as.numeric（sub（「AUTHOR_ID：」，「」，xmlValue（xml [「// dc：標識符「] [[1]]）））' –

編輯： 試圖編輯頂部節點的方法後（見Old Answer），我意識到編輯頂層節點並不能解決我的問題，因爲SAS XML映射器沒有保留所有的ID。

我試着將作者id添加到每個完美工作的子節點的新方法。我還了解到，您可以使用XPath通過將它們放入一個載體，像這樣選擇多個節點：

c("//coredata", 
    "//affiliation-current", 
    "affiliation-history", 
    "subject-areas", 
    "//author-profile")

所以我用最後的方案是：

files <- list.files() 

for (i in 1:length(files)) { 
    author_record <- xmlParse(files[i]) 

    xpathApply(
      author_record, c(
       "//coredata", 
       "//affiliation-current", 
       "affiliation-history", 
       "subject-areas", 
       "//author-profile" 
     ), 
      addAttributes, 
      auth_id = gsub("AUTHOR_ID:", "", xmlValue(author_record[["//dc:identifier"]])) 
    ) 

    saveXML(author_record, file = files[i]) 
}

老答案： 經過多次實驗，我發現了一個相對簡單的解決方案來解決我的問題

屬性可以通過簡單地使用

addAttributes(xmlRoot(xmlfile), attribute = "attributeValue")

對於我的具體情況下被添加到頂級節點，最簡單的解決方案將是一個簡單的循環：

setwd("C:/directory/with/individual/xmlfiles") 

files <- list.files() 

for (i in 1:length(files)) { 

author_record <- xmlParse(files[i]) 

addAttributes(node = xmlRoot(author_record), 
       id = gsub (pattern = "AUTHOR_ID:", 
           replacement = "", 
           x = xmlValue(auth[["//dc:identifier"]]) 
       ) 
) 

    saveXML(author_record, file = files[i]) 
}

我敢肯定有更好的方法。顯然我需要學習XLST，這是一個非常強大的方法！

來源

2015-11-12 19:48:09

爲了根據數據集和數據框的結構將XML數據遷移到行和列的二維中，必須刪除所有嵌套以僅迭代父級和一個子級。因此，XSLT是專門針對任何細微差別需要重構XML文檔的專用聲明性編程語言，它們可以用來重構XML數據以滿足最終用戶的需求。

給出您的示例XML，下面是一個可運行的XSLT，並將結果XML成功導入SAS。讓SAS代碼循環重構所有數千個XML文件。

XSLT（保存爲的.xsl或.xslt格式）

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" 
     xmlns:ait="http://www.elsevier.com/xml/ani/ait" 
     xmlns:ce="http://www.elsevier.com/xml/ani/common" 
     xmlns:cto="http://www.elsevier.com/xml/cto/dtd" 
     xmlns:dc="http://purl.org/dc/elements/1.1/" 
     xmlns:ns1="http://webservices.elsevier.com/schemas/search/fast/types/v4" 
     xmlns:prism="http://prismstandard.org/namespaces/basic/2.0/" 
     xmlns:xocs="http://www.elsevier.com/xml/xocs/dtd" 
     xmlns:xoe="http://www.elsevier.com/xml/xoe/dtd" 
     exclude-result-prefixes="ait ce cto dc ns1 prism xocs xoe"> 
<xsl:output version="1.0" encoding="UTF-8" indent="yes" /> 

<xsl:template match="author-retrieval-response"> 
    <xsl:variable select="substring-after(coredata/dc:identifier, ':')" name="authorid"/> 
    <root> 
     <coredata> 
     <authorid><xsl:value-of select="$authorid"/></authorid> 
     <xsl:for-each select="coredata/*">   
      <xsl:element name="{local-name()}">  
      <xsl:value-of select="concat(.,@href)"/> 
      </xsl:element> 
     </xsl:for-each> 
     </coredata> 

     <subjectAreas> 
     <authorid><xsl:value-of select="$authorid"/></authorid> 
     <xsl:for-each select="subject-areas/*">   
      <xsl:element name="{local-name()}">  
      <xsl:value-of select="."/> 
      </xsl:element> 
     </xsl:for-each> 
     </subjectAreas> 

     <authorname> 
     <authorid><xsl:value-of select="$authorid"/></authorid> 
     <xsl:for-each select="author-profile/preferred-name/*">   
      <xsl:element name="{local-name()}">  
      <xsl:value-of select="."/> 
      </xsl:element> 
     </xsl:for-each> 
     </authorname> 

     <classifications> 
     <authorid><xsl:value-of select="$authorid"/></authorid> 
     <xsl:for-each select="author-profile/classificationgroup/classifications/*">   
      <xsl:element name="{local-name()}">  
      <xsl:value-of select="."/> 
      </xsl:element> 
     </xsl:for-each> 
     </classifications> 

     <journals> 
     <authorid><xsl:value-of select="$authorid"/></authorid> 
     <xsl:for-each select="author-profile/journal-history/journal/*">   
      <xsl:element name="{local-name()}">  
      <xsl:value-of select="."/> 
      </xsl:element> 
     </xsl:for-each> 
     </journals> 

     <ipdoc> 
     <authorid><xsl:value-of select="$authorid"/></authorid> 
     <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/*[not(local-name()='address')]">   
      <xsl:element name="{local-name()}">  
      <xsl:value-of select="."/> 
      </xsl:element> 
     </xsl:for-each> 
     </ipdoc> 

     <address> 
     <authorid><xsl:value-of select="$authorid"/></authorid> 
     <xsl:for-each select="author-profile/affiliation-current/affiliation/ip-doc/address/*">   
      <xsl:element name="{local-name()}">  
      <xsl:value-of select="."/> 
      </xsl:element> 
     </xsl:for-each> 
     </address> 
    </root> 
</xsl:template> 

</xsl:transform>

SAS（使用上述腳本）

proc xsl 
    in="C:\Path\To\Original.xml" 
    out="C:\Path\To\Output.xml" 
    xsl="C:\Path\To\XSLT.xsl"; 
run; 

** STORING XML CONTENT; 
libname temp xml 'C:\Path\To\Output.xml'; 

** APPEND CONTENT TO SAS DATASETS; 
data Work.Coredata; 
    retain authorid; 
    set temp.Coredata; ** NAME OF PARENT NODE IN XML; 
run; 

data Work.SubjectAreas; 
    retain authorid; 
    set temp.SubjectAreas; ** NAME OF PARENT NODE IN XML; 
run; 

data Work.Authorname; 
    retain authorid; 
    set temp.Authorname; ** NAME OF PARENT NODE IN XML; 
run; 

data Work.Classifications; 
    retain authorid; 
    set temp.Classifications; ** NAME OF PARENT NODE IN XML; 
run; 

data Work.Journals; 
    retain authorid; 
    set temp.Journals; ** NAME OF PARENT NODE IN XML; 
run; 

data Work.Ipdoc;  
    retain authorid; 
    set temp.Ipdoc; ** NAME OF PARENT NODE IN XML; 
run;

XML OUTPUT（其被導入作爲一行和40個變量的Authorsdata數據集）

<?xml version="1.0" encoding="UTF-8"?> 
<root> 
    <coredata> 
     <authorid>1234567</authorid> 
     <url>http://api.elsevier.com/content/author/author_id/1234567</url> 
     <identifier>AUTHOR_ID:1234567</identifier> 
     <eid>9-s2.0-1234567</eid> 
     <document-count>3</document-count> 
     <cited-by-count>95</cited-by-count> 
     <citation-count>97</citation-count> 
     <link>http://api.elsevier.com/content/search/scopus?query=refauid%1234567%29</link> 
     <link>http://www.scopus.com/authid/detail.url?partnerID=HzOxMe3b&amp;authorId=1234567&amp;origin=inward</link> 
     <link>http://api.elsevier.com/content/author/author_id/1234567</link> 
     <link>http://api.elsevier.com/content/search/scopus?query=au-id%281234567%29</link> 
    </coredata> 
    <subjectAreas> 
     <authorid>1234567</authorid> 
     <subject-area>Human-Computer Interaction</subject-area> 
     <subject-area>Control and Systems Engineering</subject-area> 
     <subject-area>Software</subject-area> 
     <subject-area>Computer Vision and Pattern Recognition</subject-area> 
     <subject-area>Artificial Intelligence</subject-area> 
    </subjectAreas> 
    <authorname> 
     <authorid>1234567</authorid> 
     <initials>A.</initials> 
     <indexed-name>John A.</indexed-name> 
     <surname>John</surname> 
     <given-name>Doe</given-name> 
    </authorname> 
    <classifications> 
     <authorid>1234567</authorid> 
     <classification>1709</classification> 
     <classification>2207</classification> 
     <classification>1712</classification> 
     <classification>1707</classification> 
     <classification>1702</classification> 
    </classifications> 
    <journals> 
     <authorid>1234567</authorid> 
     <sourcetitle>Very Prestigious Journal</sourcetitle> 
     <sourcetitle-abbrev>V PRES JOU Autom</sourcetitle-abbrev> 
     <issn>10504729</issn> 
     <sourcetitle>2005 Another Prestigious Journal</sourcetitle> 
     <sourcetitle-abbrev>An. Prest. Jou. </sourcetitle-abbrev> 
    </journals> 
    <ipdoc> 
     <authorid>1234567</authorid> 
     <afnameid>Prestigious University#1111111</afnameid> 
     <afdispname>Prestigious University University</afdispname> 
     <preferred-name>Prestigious University University</preferred-name> 
     <sort-name>Prestigious University</sort-name> 
     <org-domain>pu.edu</org-domain> 
     <org-URL>http://www.pu.edu/index.shtml</org-URL> 
    </ipdoc> 
    <address> 
     <authorid>1234567</authorid> 
     <address-part>1234 Prestigious Lane</address-part> 
     <city>City</city> 
     <state>ST</state> 
     <postal-code>12345</postal-code> 
     <country>United States</country> 
    </address> 
</root>

[R另類

由於沒有全面的[R XSLT庫中存在，解析將不得不在R輸入語言直接完成。但是，R可以通過命令行，RCOMClient包和其他接口調用其他可執行文件（即Python，Saxon，VBA）的XSLT處理器。

儘管如此，R可以爲authorid通過xmlToDataFrame()和xpathSApply()（後者類似XPath）提取XML數據：

library(XML) 

coredata <- xmlToDataFrame(nodes = getNodeSet(doc, '//coredata')) 
coredata$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "", 
          xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]]) 

subjectareas <- xmlToDataFrame(nodes = getNodeSet(doc, "//subject-areas")) 
subjectareas$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "", 
           xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]]) 

authorname <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/preferred-name')) 
authorname$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "", 
          xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]]) 

classifications <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/classificationgroup/classifications')) 
classifications$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "", 
           xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]]) 

journal <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/journal-history/journal')) 
journal$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "", 
         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]]) 

ipdoc <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc')) 
ipdoc$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "", 
         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]]) 

address <- xmlToDataFrame(nodes = getNodeSet(doc, '//author-profile/affiliation-current/affiliation/ip-doc/address')) 
address$authorid <- gsub(pattern = "AUTHOR_ID:", replacement = "", 
         xpathSApply(doc, '//coredata/dc:identifier', xmlValue)[[1]])

來源

2015-11-11 22:25:05 Parfait

什麼樣的巫術...... 這是一個了不起的答案，謝謝你的透徹。所有的信息在一個，但有關係數據集與每一組信息分開，但在每個獨特的標識符。我會仔細閱讀這一點，並盡我所能學習，除非你有其他想法，我會標記此作爲答案很快 –

查看更新XSLT可以使用[variables]（http://www.w3schools.com/xsl/el_variable.asp），它可以被傳遞到文檔的其他部分甚至用[substring-after]（http://zvon.org/xxl/XSLTreference/OutputOverview/function_substring-after_frame.html）函數解析'Author：'。所以'authorid'可以傳入其他相關節點。事實上，我剛剛在這裏瞭解到，SAS可以從一個XML導入多個表格！當然會將這個例子添加到我的圖書館。至於R，只需將'xmltodataframe'用於節點集並將'xmlSApply（）'用作authorids。感謝您的問題！ – Parfait

如何修改R中的頂級XML節點？

回答

相關問題