2017-04-24 86 views
0

我有80,000個XML文件,它們應該使用相同的格式。但是,情況顯然不是這樣。因此,我試圖識別文件中的所有現有節點和子節點。確定列表中所有可能的父母和孩子

我已經使用XML包將XML文件導入爲列表,並在下面描述了我的輸入和我所需的輸出。

輸入(名單列表):

XML1 <- list(name = "Company Number 1", 
      adress = list(street = "JP Street", number = "12"), 
      product = "chicken") 

XML2 <- list(name = "Company Number 2", 
      company_adress = list(street = "House Street", number = "93"), 
      invoice = list(quantity = "2", product = "phone")) 

XML3 <- list(company_name = "Company Number 3", 
      adress = list(street = "Lake Street", number = "1"), 
      invoice = list(quantity = "2", product = "phone", list(note = "Phones are refurbished"))) 

輸出(樹形結構跨文件與出現的次數在葉子):

List of 5 
$ name   : num 2 
$ company_name : num 1 
$ adress  :List of 2 
    ..$ street: num 2 
    ..$ number: num 2 
$ company_adress:List of 2 
    ..$ street: num 1 
    ..$ number: num 1 
$ invoice  :List of 3 
    ..$ quantity: num 2 
    ..$ product : num 2 
    ..$   :List of 1 
    .. ..$ note: num 1 
$ product  : num 1 

是否有一個包,可以沿着這條線做一些事情,還是我需要寫一個自己做這個的函數?

回答

0

我編寫了一個解決問題的遞歸循環。這不是優雅的,但它有訣竅。

該函數採用嵌套列表和空向量。

# Summary tree for storing results 
summary_tree <- list() 

# Function 
tree_merger <- function(tree, position) { 
    # Testing if at the leaf of a tree 
    if (is.character(tree) | is.null(tree)) { 
    print("DONE") 
    } else { 
    # Position in tree 
    if (length(position) == 0) { 
     # Names of nodes 
     tree_names <- names(tree) 

     # Adding one to each name 
     for (i in 1:length(tree_names)) { 
     if (is.null(summary_tree[[tree_names[i]]])) { 
      summary_tree[[tree_names[i]]] <<- list(1) 
     } else { 
      summary_tree[[tree_names[i]]] <<- list(summary_tree[[tree_names[i]]][[1]] + 1) 
     } 

     # Running function on new tree 
     tree_merger(tree[[tree_names[i]]], c(position, tree_names[i])) 
     } 
    } else { 
     # Names of nodes 
     tree_names <- names(tree) 

     # Finding position in tree to save information 
     position_string <- NULL 
     for (p in position) { 
     position_string <- paste(position_string, "[[\"", p, "\"]]", sep = "") 
     } 
     position_string <- paste("summary_tree", position_string, sep = "") 

     # Adding one to each position 
     for (i in 1:length(tree_names)) { 
     position_string_full <<- paste(position_string, "[[\"", tree_names[i], "\"]]", sep = "") 

     # Adding to position 
     if(is.null(eval(parse(text=position_string_full)))) { 
     eval(parse(text=paste(position_string_full, "<<- list(1)"))) 
     } else { 
      eval(parse(text=paste(position_string_full, "<<- list(", position_string_full ,"[[1]] + 1)"))) 
     } 

     # Running function on new tree 
     tree_merger(tree[[tree_names[i]]], c(position, tree_names[i])) 
     } 
    } 
    } 
} 

如果有人遇到同樣的問題,應該注意的是,應該可能改變關於如何退出遞歸的代碼。對於我的XML文件,所有「葉子」都以字符串或NULL結束。在其他列表中,它可以是其他類型的值。