用R解析XML文件的目錄

我正在解析從ClinicalTrials.gov下載的xml文件的目錄，並且無法提取數據。我可以爲單個文件（下面的NCT00006435.xml）執行此操作，但無法弄清楚如何爲多個文件執行此操作。用R解析XML文件的目錄

library(XML) 
# Download ct.gov query and extract xml files 
ct<-tempfile() 
dir.create("ctdir") 
url<-"https://clinicaltrials.gov/search?term=neurofibromatosis-type-1&studyxml=true" 
download.file(url, ct) 
unzip(ct, exdir="ctdir") 
files<-list.files("ctdir") 
# Change the working directory so we don't have to worry about the filepath 
setwd("ctdir") 

# Extract data from one file and get it into a data frame 
#xmlfile<-xmlTreeParse("NCT00006435.xml") 
#xmltop<-xmlRoot(xmlfile) 
#tags<-xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) 
#tags_df<-data.frame(t(tags),row.names=NULL) 

# Extract data from each file and get it into a data frame 
xmlfiles<-lapply(files,function(x) xmlTreeParse(x)) 
xmltop<-lapply(xmlfiles,function(x) xmlRoot(x)) 
tags<-???

如何運行文件列表，循環顯示每個文件中的每個標記？

來源

2016-03-03 user1357079

您需要實際下載單個文件。 'xmlTreeParse（）'在_local_文件上運行以提取XML。目前，我相信'files'只是包含一個匹配的文件名列表，因爲它們出現在服務器上。 –

另外'xmlTreeParse（）'不會自動遷移到數據框，但需要'xmlToDataFrame（）'。發佈示例xml會很有幫助。 – Parfait

Arrgh。 'object.size（xmltop）＃40 196 696 bytes'。我們可以有一個「最小」的例子嗎？你對'標籤'含義的理解是什麼？ –

STR（xmltop）的頂部看起來像：

List of 107 
$ :List of 40 
    ..$ comment    : Named list() 
    .. ..- attr(*, "class")= chr [1:5] "XMLCommentNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    ..$ required_header  :List of 3 
    .. ..$ download_date:List of 1 
    .. .. ..$ text: Named list() 
    .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass" 
    .. ..$ link_text :List of 1 
    .. .. ..$ text: Named list() 
    .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass" 
    .. ..$ url   :List of 1 
    .. .. ..$ text: Named list() 
    .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass" 
    .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass" 
    ..$ id_info    :List of 4 
    .. ..$ org_study_id:List of 1 
    .. .. ..$ text: Named list() 
    .. .. .. ..- attr(*, "class")= chr [1:5] "XMLTextNode" "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" ... 
    .. .. ..- attr(*, "class")= chr [1:4] "XMLNode" "RXMLAbstractNode" "XMLAbstractNode" "oldClass"

所以這是一個列表，你可以在「循環」了它的頂級用一個簡單的lapply。如果您想使用您的單節點案例代碼，它只是：

tags<-lapply(xmltop, function(x) xmlSApply(x, xmlValue)) 
object.size(tags) 
1618008 bytes

仍然是一個相當不方便的對象。我重申我的建議，你會找到一個更易於管理的例子。

來源

2016-03-03 02:13:44

只是將你的代碼包裝在一個函數中。

tags_df <- function(file){ 
    message("Loading ", file) 
    #your code 
    xmlfile<-xmlTreeParse(file) 
    xmltop<-xmlRoot(xmlfile) 
    tags_l<-xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) 
    tags<-data.frame(t(tags_l),row.names=NULL) 
    tags 
} 

tags<- lapply(files, tags_df)

既然你有一對多的位置，關鍵字等標籤，結合data.frames將返回一個混亂與260列的下方，包括location.1到location.120。我會用一些特定的xpath查詢代替你的代碼，以便將你真正想要的標籤變成可理解的格式。

x <- ldply(tags, "data.frame") 
names(x)

來源

2016-03-03 18:44:29

用R解析XML文件的目錄

回答

相關問題