R：從XML數據中提取特定的節點內容

使用R和XML包（xmlTreeParse等）我盡我所能從xml文件中讀取特定節點而沒有成功。以下XML示例虛設表示我使用的數據：R：從XML數據中提取特定的節點內容

<item> 
<title> Mickey Mouse </title> 
<description> Cartoon </description> 
<pubDate> 25 Apr 1965 </pubDate> 
<disney:Filing web="http://www.waltdisney.com/archives"> 
<disney:fileNumber>125364</disney:fileNumber> 
<disney:assignedID>7389</disney:assignedID> 
<disney:Files> 
    <disney:File disney:set="1" disney:file="abc.mov" disney:type="B&W"/> 
    <disney:File disney:set="2" disney:file="def.mov" disney:type="Col"/> 
    <disney:File disney:set="3" disney:file="wzt.mov" disney:type="B&W"/> 
</disney:Files> 
</disney:Filing> 
</item>

我施加xpathApply成功提取第一三個節點。但我無法到達標有「迪士尼：文件」的節點。出於某種原因，迪斯尼之外的任何事情：文件是不可讀的（「不可見」）。

我的目標是要麼提取所有的迪士尼：文件行成一個數據框或更漂亮：首先搜索特定的迪士尼：設置和提取從這個節點單獨到數據框的所有信息。任何幫助都會非常棒。提前致謝！

來源

2014-07-16 PBolbrinker

你需要在你的XPath使用的命名空間。有關更多詳細信息，請參閱'xmlNamespaces'。沒有問題的XML文件和我們不能幫助的命名空間定義。例如，可以使用'xpathSApply（doc，'// */disney：File'，xmlValue）'，但可能會有其他名稱空間。 – jdharrison

如果你真的想要做的是得到'disney：File'數據，並且相當確定它們將在單行上，'readLines' +'grep' +'str_extract'可能就足夠了。不需要因爲XML而進行緩慢/浪費內存的樹解析。當然，對於更復雜的提取（如果你對每個文件進行多個數據提取類型的話），那麼XML解析就很有意義。 – hrbrmstr

感謝你們兩位，@ jdharrison和hrbrmstr。我去readLines等，因爲這個任務似乎更簡單，更直接。很好的幫助！ – PBolbrinker

一些樣本數據

'<?xml version="1.0"?> 
<aw:PurchaseOrder 
    aw:PurchaseOrderNumber="99503" 
aw:OrderDate="1999-10-20" 
xmlns:aw="http://www.adventure-works.com"> 
<aw:Address aw:Type="Shipping"> 
<aw:Name>Ellen Adams</aw:Name> 
<aw:Street>123 Maple Street</aw:Street> 
<aw:City>Mill Valley</aw:City> 
<aw:State>CA</aw:State> 
<aw:Zip>10999</aw:Zip> 
<aw:Country>USA</aw:Country> 
</aw:Address> 
<aw:Address aw:Type="Billing"> 
<aw:Name>Tai Yee</aw:Name> 
<aw:Street>8 Oak Avenue</aw:Street> 
<aw:City>Old Town</aw:City> 
<aw:State>PA</aw:State> 
<aw:Zip>95819</aw:Zip> 
<aw:Country>USA</aw:Country> 
</aw:Address> 
<aw:DeliveryNotes>Please leave packages in shed by driveway.</aw:DeliveryNotes> 
<aw:Items> 
<aw:Item aw:PartNumber="872-AA"> 
<aw:ProductName>Lawnmower</aw:ProductName> 
<aw:Quantity>1</aw:Quantity> 
<aw:USPrice>148.95</aw:USPrice> 
<aw:Comment>Confirm this is electric</aw:Comment> 
</aw:Item> 
<aw:Item aw:PartNumber="926-AA"> 
<aw:ProductName>Baby Monitor</aw:ProductName> 
<aw:Quantity>2</aw:Quantity> 
<aw:USPrice>39.98</aw:USPrice> 
<aw:ShipDate>1999-05-21</aw:ShipDate> 
</aw:Item> 
</aw:Items> 
</aw:PurchaseOrder>' -> xData

你可以聲明namespcae這裏我們使用ns給它一個標籤。在這種情況下，我們可以只使用aw:Item但我們標記命名空間爲例：

library(XML) 
myData <- xmlParse(xData) 
> xpathSApply(myData, "//*/ns:Item/ns:ProductName" 
       , namespaces = c(ns = "http://www.adventure-works.com") 
       , xmlValue) 
[1] "Lawnmower" "Baby Monitor"

來源

2014-07-16 16:21:20 jdharrison

R：從XML數據中提取特定的節點內容

回答

相關問題