2017-01-26 40 views
1

對於TL不同的列;博士我在底部一個簡單的問題:移值基於在第二列中,XML值data.frame

我試圖把XML文件到使用 - 在R.能表

<toes copyright='(C)version='1.1'> <generated date='2017-01-21 
07:45:04'timestamp='1485006304'/> 
    <description> Active TOE vehicle levels and adjustments for the current 
    campaign up to the RDP cycle in progress. c0 = the cycle 0 capacity, adj 
    = comma-separated list of cycle:capacity adjustments, cur = current 
    capacity </description> 
    <defaults><def att='adj' value=''/></defaults> 
     <r toe="deairfor" veh="22" c0="30" cur="30"/> 
     <r toe="deairfor" veh="23" c0="40" cur="20" adj="1:35,2:20"/> 
     <r toe="deairfor" veh="26" c0="2" cur="2" adj="2:10,3:30"/> 
</toes> 

我預期的格式是這樣的:

"TOE" "Veh" "c0" "cur" "adj1" "adj2" "adj3" 
"deairfor" 22 30 30 NA NA NA 
"deairfor" 23 40 20 35 20 NA 
"deairfor" 26 2 2 NA 10 30 

我有導入XML文件,零經驗,但我認爲這個文件是不是格式正確,因爲我還沒有遇到任何帶有標籤內數據的XML示例,如< r趾「... data ...」/>。我已經能夠用下面來提取數據:

library(XML) 
source <- "http://wiretap.wwiionline.com/xml/toes.sheet.xml" 
xmlfile <- xmlTreeParse(source, useInternalNodes = TRUE) 
nodes <- getNodeSet(xmlfile, "/toes//r") 
Df1 <- NULL 

for(i in 1:length(nodes)) { 
Df1 <- t(xmlToList(nodes[[i]])) 
Df2 <- smartbind(Df2,Df1[1,]) 
} 

我只能在一個時間提取1行,所以我用了以後的代碼綁定在一起這些。我需要df1/2,否則它會在i = 1時出錯。用不同的方法可能會容易得多,但我無法使它工作。

這給我留下了一個數據幀DF2,所有的變量「因素」(爲什麼?)

"TOE" "Veh" "c0" "cur" "adj" 
deairfor 22 30 30 NA 
deairfor 23 40 20 35 1:35,2:20 
deairfor 26 2 2 2 2:10,3:30 

所以現在的困難就在於這個「ADJ」一欄。我可以將它與下列分開:

Df2 <- separate(data = Df2, col = adj, into = c("adj1", adj2","adj3"), sep = "\\,") 
Df2 <- separate(data = Df2, col = adj1, into = c("adj1","adj1value"), sep = "\\:") 
Df2 <- separate(data = Df2, col = adj2, into = c("adj2","adj2value"), sep = "\\:") 
Df2 <- separate(data = Df2, col = adj3, into = c("adj3","adj3value"), sep = "\\:") 

但是單元格不在右列。 DF2現在是如下:

"TOE" "Veh" "c0" "cur" "adj1" "adj1value" "adj2" "adj2value" "adj3" "adj3value" 
deairfor 22 30 30 NA NA NA NA NA NA 
deairfor 23 40 20 1 35 2 20 NA NA 
deairfor 26 2 2 2 10 3 30 NA NA 

雖然這最後一行必須是:(一旦adj1values是在適當的列我們也可以降ADJ1/ADJ2/ADJ3)

deairfor 26 2 2 NA NA 2 10 3 30 

我已經試過無數方法將這些細胞移動到右側,但不斷出現錯誤,如:(的調整*列字符,因此分離的「1」之後)

Df2$adj3[Df2$adj1 == "1"] <- Df2$adj2 
Df2$adj3value[Df2$adj1 == "1"] <- Df2$adj2value 
"NAs are not allowed in subscripted assignments" 

所以問題:我如何將這些VA適合專欄?

"TOE" "Veh" "c0" "cur" "adj" 
deairfor 26 2 2 2:10,3:30 

應該成爲

"TOE" "Veh" "c0" "cur" "adj1" "adj2" "adj3" 
deairfor 26 2 2 NA 10 30 

獎金的問題:我得到我需要使用許多行,因爲在開始XML導入並不是最佳選擇,反正做的更好給出的感覺我有目標?

+0

嘗試一些什麼這個帖子用來從XML創建一個框架,看看它是否適合你。http://stackoverflow.com/questions/17198658/how- to-parse-xml-to-r-data-frame – sconfluentus

+0

好奇的是,您發佈的xml與網址不匹配,因爲網頁沒有* adj * attribs。 – Parfait

+0

是的,網頁隨着時間的推移而更新,Adj只會在兩週內再次出現,不幸的是 –

回答

1

我會寫,可以增加港定居人士的形容詞前綴字符串,然後使用tidyr的separate

add_NAs <- function(x, n=3){ 
    y <- strsplit(x, ",") 
    sapply(y, function(z){ 
     n <- match(1:n, substr(z,1,1)) 
     paste(substring(z, 3)[n], collapse=",") 
    }) 
} 
add_NAs(c(NA, "1:35,2:20", "2:10,3:30", "1:20,3:5")) 
[1] "NA,NA,NA" "35,20,NA" "NA,10,30" "20,NA,5" 

您還可以使用xmlAttrsToDataFrame解析屬性的功能。

x <- XML:::xmlAttrsToDataFrame(doc["//r[@toe]"], stringsAsFactors=FALSE) 
x$adj <- add_NAs(x$adj) 
separate(x, adj, c("adj1", "adj2", "adj3"), sep="," , convert=TRUE) 
     toe veh c0 cur adj1 adj2 adj3 
1 deairfor 22 30 30 NA NA NA 
2 deairfor 23 40 20 35 20 NA 
3 deairfor 26 2 2 NA 10 30 
+0

從來不知道'xmlAttrsToDataFrame'方法。將它添加到我的庫! – Parfait

0

謝謝克里斯的幫助,真的回答了我所有的問題! 下面顯示的最終代碼適用於任何有興趣的人。

我只需要插入一行首先下載xml文件,否則它不會撿起它。我使用的主題爲:(https://stackoverflow.com/questions/24139221/reading-and-understanding-xml-in-r) 此外,對於此表,我希望在調整之後級別爲「繼續」,這就是我在末尾5個相似行所做的操作,所以如果c0 = 10,則adj1 = 20,adj2 = NA則ADJ2 /層2 20。=

library(XML) 
library(tidyr) 
add_NAs <- function(x, n=5){ 
    y <- strsplit(x, ",") 
    sapply(y, function(z){ 
    n <- match(1:n, substr(z,1,1)) 
    paste(substring(z, 3)[n], collapse=",") 
    }) 
} 

fileURL <- "http://wiretap.wwiionline.com/xml/toes.sheet.xml" 
download.file(fileURL, destfile=tf <- tempfile(fileext=".xml")) 
doc <- xmlParse(tf) 
Test <- XML:::xmlAttrsToDataFrame(doc["//r[@toe]"], stringsAsFactors=FALSE) 
Test$adj <- add_NAs(Test$adj) 
Test <- separate(data = Test, col = adj, into = c("Tier1","Tier2","Tier3","Tier4","Tier5"), sep = "\\,") 
Test$Tier1 <- ifelse(Test$Tier1=="NA",Test$c0,Test$Tier1) 
Test$Tier2 <- ifelse(Test$Tier2=="NA",Test$Tier1,Test$Tier2) 
Test$Tier3 <- ifelse(Test$Tier3=="NA",Test$Tier2,Test$Tier3) 
Test$Tier4 <- ifelse(Test$Tier4=="NA",Test$Tier3,Test$Tier4) 
Test$Tier5 <- ifelse(Test$Tier5=="NA",Test$Tier4,Test$Tier5) 
+0

正在尋找一個方法來做到這一點。再次檢查並找到按鈕,謝謝! –

相關問題