從R統計中的XML文件創建數據集

我想下載期刊文章記錄的XML文件並在R中創建進一步詢問的數據集。我完全不熟悉XML並且在R方面很新手。我拼湊了同時一些代碼中使用的碼位從2個來源： GoogleScholarXScraper 和 Extracting records from pubMed 從R統計中的XML文件創建數據集

library(RCurl) 
library(XML) 
library(stringr) 

#Search terms 
SearchString<-"cancer+small+cell+non+lung+survival+plastic" 
mySearch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=",SearchString,"&usehistory=y",sep="",collapse=NULL) 

#Seach 
pub.esearch<-getURL(mySearch) 

#Extract QueryKey and WebEnv 
pub.esearch<-xmlTreeParse(pub.esearch,asText=TRUE) 
key<-as.numeric(xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["QueryKey"]])) 
env<-xmlValue(pub.esearch[["doc"]][["eSearchResult"]][["WebEnv"]]) 

#Fetch Records 
myFetch<-str_c("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&WebEnv=",env,"&retmode=xml&query_key=",key) 
pub.efetch<-getURL(myFetch) 
myxml<-xmlTreeParse(pub.efetch,asText=TRUE,useInternalNodes=TRUE) 

#Create dataset of article characteristics #This doesn't work 
pub.data<-NULL 
pub.data<-data.frame(
    journal <- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/MedlineJournalInfo/MedlineTA", xmlValue), 
    abstract<- xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Abstract/AbstractText",xmlValue), 
    affiliation<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Affiliation", xmlValue), 
    year<-xpathSApply(myxml,"//PubmedArticle/MedlineCitation/Article/Journal/JournalIssue/PubDate/Year", xmlValue) 
    ,stringsAsFactors=FALSE)

主要的問題我似乎已經是我返回的XML文件並沒有完全統一的結構。例如，一些引用有一個節點結構是這樣的：

- <Abstract> 
<AbstractText>The Wilms' tumor gene... </AbstractText>

雖然一些有標籤和都是這樣

- <Abstract> 
<AbstractText Label="BACKGROUND &#38; AIMS" NlmCategory="OBJECTIVE">Some background text.</AbstractText> 
<AbstractText Label="METHODS" NlmCategory="METHODS"> Some text on methods.</AbstractText>

當我提取「AbstactText」我希望能獲得數據的24行返回（當我運行搜索時，有24條記錄），但xpathSApply將'AbstactText'中的所有標籤作爲我的數據框的單獨元素返回。有沒有辦法在這種情況下摺疊XML結構/忽略標籤？有沒有辦法讓xpathSApply在路徑末尾找不到任何東西時返回'NA'？我知道xmlToDataFrame，這聽起來像它應該適合法案，但每當我嘗試使用它，它似乎並沒有給我任何明智的。

xpathSApply(myxml,"//*/AbstractText[@Label]")

將得到節點與標籤（保持所有屬性等）：

感謝您的幫助

來源

2012-07-25 DavidT85

哪個你想不過我不確定。

xpathSApply(myxml,"//*/AbstractText[not(@Label)]",xmlValue)

將得到沒有標籤的節點。

編輯：

test<-xpathApply(myxml,"//*/Abstract",xmlValue) 

> length(test) 
[1] 24

可以給你想要的東西

編輯：

取得聯繫，一年等補齊NA的

dumfun<-function(x,xstr){ 
res<-xpathSApply(x,xstr,xmlValue) 
if(length(res)==0){ 
out<-NA 
}else{ 
out<-res 
} 
out 
} 

xpathSApply(myxml,"//*/Article",dumfun,xstr='./Affiliation') 
xpathSApply(myxml,"//*/Article",dumfun,xstr='./Journal/JournalIssue/PubDate/Year')

來源

2012-07-25 17:58:51 shhhhimhuntingrabbits

謝謝你，工作的偉大。對於某些條目，某些值不可用，例如，我仍然無法形成數據集。 2條記錄不報告關聯關係。關於如何獲得「不適用」的任何想法返回？ – DavidT85 2012-07-26 11:32:00

從R統計中的XML文件創建數據集

回答

相關問題