2015-11-06 22 views
1

我想將一個XML文件轉換爲R,作爲我可以輕鬆處理的數據框。我已經在網上很容易地看過示例,但我找不到與我的XML文件類似的示例。它來自US Treasury website閱讀'messy'將XML文件轉換爲R

這裏是URL到特定的XML文件:

http://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData?$filter=year(NEW_DATE)%20eq%202005

我想獲得一個數據幀,它看起來像網站日(列在表1MO,3MO,6mo ,. ..)。下面的代碼正在運行,但沒有給我想要的結果。我懷疑這是因爲XML文件比我一直在看的例子更復雜。

Ad<-'http://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData?$filter=year(NEW_DATE)%20eq%202005' 
XML <- xmlTreeParse(Ad) 
xmltop <- xmlRoot(xmlfile) 
XMLcat <- xmlSApply(xmltop, function(x) xmlSApply(x, xmlValue)) 
plantcat_df <- data.frame(t(XMLcat),row.names=NULL) 

也不是

Data <- xmlToDataFrame(xml.url) 

回答

3

你真的SHLD在XML命名空間以及它們如何R中工作,也XPath的一般閱讀起來。另外,xml2是一個較新的XML pkg,有一些很好的功能,你應該看看。

library(xml2) 

# read the doc 
doc <- read_xml("http://data.treasury.gov/feed.svc/DailyTreasuryYieldCurveRateData?$filter=year(NEW_DATE)%20eq%202005") 

# libxml2 + R == "meh" handling of default namespaces 
ns <- xml_ns_rename(xml_ns(doc), d1="default") 

# all the info is in the properties tag so focus on it 
props <- xml_find_all(doc, "//default:entry/default:content/m:properties", ns) 

# lots of ways to extract, but this data is "regular" enough to take a 
# rather simplistic approach. Extract all the node values which will be 
# separated by newlines. Convert newlines to tabs, trim the whole thing 
# and read it in as a table. 
dat <- read.table(text=trimws(gsub("\n", "\t", unlist(lapply(props, xml_text)))), 
        sep="\t", stringsAsFactors=FALSE) 

# column names wld be good so build those from one property node 
colnames(dat) <- xml_name(xml_children(props[[1]])) 

# boom: done 
str(dat) 
## 'data.frame': 250 obs. of 14 variables: 
## $ Id    : int 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 ... 
## $ NEW_DATE  : chr "  2005-11-14T00:00:00" "  2005-11-10T00:00:00" "  2005-11-15T00:00:00" "  2005-11-17T00:00:00" ... 
## $ BC_1MONTH  : num 3.93 3.89 4.01 3.98 4 ... 
## $ BC_3MONTH  : num 4.02 3.97 4.01 4.01 4 ... 
## $ BC_6MONTH  : num 4.35 4.3 4.34 4.3 4.3 ... 
## $ BC_1YEAR  : num 4.4 4.34 4.38 4.32 4.34 ... 
## $ BC_2YEAR  : num 4.5 4.44 4.47 4.37 4.42 ... 
## $ BC_3YEAR  : num 4.52 4.48 4.5 4.39 4.43 ... 
## $ BC_5YEAR  : num 4.54 4.49 4.51 4.39 4.43 ... 
## $ BC_7YEAR  : num 4.57 4.51 4.52 4.42 4.45 ... 
## $ BC_10YEAR  : num 4.61 4.55 4.56 4.46 4.49 ... 
## $ BC_20YEAR  : num 4.9 4.85 4.83 4.75 4.77 ... 
## $ BC_30YEAR  : logi NA NA NA NA NA NA ... 
## $ BC_30YEARDISPLAY: int 0 0 0 0 0 0 0 0 0 0 ...