2014-03-25 69 views
5

我想從ClinicalTrials.gov的XML文件中提取信息。該文件分爲以下幾個方式:如何將XML的一部分轉換爲數據框? (正確)

<clinical_study> 
    ... 
    <brief_title> 
    ... 
    <location> 
    <facility> 
     <name> 
     <address> 
     <city> 
     <state> 
     <zip> 
     <country> 
    </facility> 
    <status> 
    <contact> 
     <last_name> 
     <phone> 
     <email> 
    </contact> 
    </location> 
    <location> 
    ... 
    </location> 
    ... 
</clinical_study> 

我可以使用R XML包從CRAN在下面的代碼從XML文件中提取的所有位置的節點:

library(XML) 
clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true" 
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE) 
locations <- xmlToDataFrame(getNodeSet(xmlDoc,"//location")) 

這個工程樣的好。 但是,如果您查看數據框,您會注意到xmlToDataFrame函數將所有在<facility>之下的所有內容整合到一個單獨的連接字符串中。一種解決方案是編寫代碼來逐列生成數據幀,例如,您可以生成

+1

你可以這樣做:'xpathSApply(xmlDoc中, 「// clinical_study /位置/設備/名稱」,xmlValue)''到了分開的吸'每個組件。我不確定如何一舉完成。 – thelatemail

+0

謝謝thelatemail –

+1

你做了什麼爲我完美工作。我的XML文件很簡單。 – Chernoff

回答

7

您可以先將XML展平。

flatten_xml <- function(x) { 
    if (length(xmlChildren(x)) == 0) structure(list(xmlValue(x)), .Names = xmlName(xmlParent(x))) 
    else Reduce(append, lapply(xmlChildren(x), flatten_xml)) 
} 

dfs <- lapply(getNodeSet(xmlDoc,"//location"), function(x) data.frame(flatten_xml(x))) 
allnames <- unique(c(lapply(dfs, colnames), recursive = TRUE)) 
df <- do.call(rbind, lapply(dfs, function(df) { df[, setdiff(allnames,colnames(df))] <- NA; df })) 
head(df) 

#   city  state zip  country  status   last_name  phone     email    last_name.1 
# 1 Birmingham Alabama 35294 United States Recruiting Louis B Nabors, MD 205-934-1813   [email protected]  Louis B Nabors, MD 
# 2  Mobile Alabama 36604 United States Recruiting Melanie Alford, RN 251-445-9649  [email protected] Pamela Francisco, CCRP 
# 3  Phoenix Arizona 85013 United States Recruiting  Lynn Ashby, MD 602-406-6262   [email protected]   Lynn Ashby, MD 
# 4  Tucson Arizona 85724 United States Recruiting   Jamie Holt 520-626-6800 [email protected] Baldassarre Stea, MD, PhD 
# 5 Little Rock Arkansas 72205 United States Recruiting Wilma Brooks, RN 501-686-8530  [email protected]  Amanda Eubanks, APN 
# 6 Berkeley California 94704 United States Withdrawn    <NA>   <NA>      <NA>      <NA> 
+0

謝謝,它工作。出於某種原因,我的編譯器不喜歡該函數的語法,所以我不得不將其更改爲:'flatten_xml < - function(x) { if (length(xmlChildren(x))== 0) {結構(list(xmlValue(x)),.Names = xmlName(xmlParent(x)))} else {Reduce(append,lapply(xmlChildren(x),flatten_xml))} } –

+0

是的,我認爲我們正在使用不同的版本。固定。 –

+0

當你有機會時,不要忘記接受我的答案。 :) –

3

這個答案的XML轉換成一個列表,unlists每個位置部分,調換部分,轉換部分爲data.table,然後使用rbindlist到所有的個別位置的合併成一個表。 fill=T參數按名稱匹配元素,並使用NA填充缺失的元素值。

library(XML); library(data.table) 

clinicalTrialUrl <- "http://clinicaltrials.gov/ct2/show/NCT01480479?resultsxml=true" 
xmlDoc <- xmlParse(clinicalTrialUrl, useInternalNode=TRUE) 

xmlToDT <- function(doc, path) { 
    rbindlist(
    lapply(getNodeSet(doc, path), 
      function(x) data.table(t(unlist(xmlToList(x)))) 
    ), fill=T) 
} 

locationDT <- xmlToDT(xmlDoc, "//location") 
locationDT[1:6] 
##                  facility.name facility.address.city facility.address.state facility.address.zip 
## 1:                "HYGEIA" Hospital    Marousi  District of Attica    151 23 
## 2: Allina Health, Abbott Northwestern Hospital, John Nasseff Neuroscience Institute   Minneapolis    Minnesota    55407 
## 3:     Amrita Institute of Medical Sciences and Research Centre, Kochi     Kochi     Kerala    682 026 
## 4:              Anne Arundel Medical Center    Annapolis    Maryland    21401 
## 5:                Atlanta Cancer Care    Atlanta    Georgia    30005 
## 6:                 Austin Health   Heidelberg    Victoria     3084 
## facility.address.country 
## 1:     Greece 
## 2:   United States 
## 3:     India 
## 4:   United States 
## 5:   United States 
## 6:    Australia 
相關問題