2017-07-25 71 views
1

代表性的樣本數據的列表中提取數據(名單列表):從列表到自己的`data.frame`與`purrr`

l <- list(structure(list(a = -1.54676469632688, b = "s", c = "T", 
d = structure(list(id = 5L, label = "Utah", link = "Asia/Anadyr", 
    score = -0.21104594634643), .Names = c("id", "label", 
"link", "score")), e = 49.1279871269422), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.934821052832427, 
b = "k", c = "T", d = list(structure(list(id = 8L, label = "South Carolina", 
    link = "Pacific/Wallis", score = 0.526540892113734, externalId = -6.74354377676955), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 9L, label = "Nebraska", link = "America/Scoresbysund", 
    score = 0.250895465294041, externalId = 16.4257470807879), .Names = c("id", 
"label", "link", "score", "externalId"))), e = 52.3161400117052), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.27261485993069, b = "f", 
c = "P", d = list(structure(list(id = 8L, label = "Georgia", 
    link = "America/Nome", score = 0.526494135483816, externalId = 7.91583574935589), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 2L, label = "Washington", link = "America/Shiprock", 
    score = -0.555186440792989, externalId = 15.0686663219837), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 6L, label = "North Dakota", link = "Universal", 
    score = 1.03168296038975), .Names = c("id", "label", 
"link", "score")), structure(list(id = 1L, label = "New Hampshire", 
    link = "America/Cordoba", score = 1.21582056168681, externalId = 9.7276418869132), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 1L, label = "Alaska", link = "Asia/Istanbul", score = -0.23183264861979), .Names = c("id", 
"label", "link", "score")), structure(list(id = 4L, label = "Pennsylvania", 
    link = "Africa/Dar_es_Salaam", score = 0.590245339334121), .Names = c("id", 
"label", "link", "score"))), e = 132.1153538536), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = 0.202685974077313, b = "x", 
c = "O", d = structure(list(id = 3L, label = "Delaware", 
    link = "Asia/Samarkand", score = 0.695577130634724, externalId = 15.2364820698193), .Names = c("id", 
"label", "link", "score", "externalId")), e = 97.9908914452971), .Names = c("a", 
"b", "c", "d", "e")), structure(list(a = -0.396243444741009, 
b = "z", c = "P", d = list(structure(list(id = 4L, label = "North Dakota", 
    link = "America/Tortola", score = 1.03060272795705, externalId = -7.21666936522344), .Names = c("id", 
"label", "link", "score", "externalId")), structure(list(
    id = 9L, label = "Nebraska", link = "America/Ojinaga", 
    score = -1.11397997280413, externalId = -8.45145052697411), .Names = c("id", 
"label", "link", "score", "externalId"))), e = 123.597945533926), .Names = c("a", 
"b", "c", "d", "e"))) 

我有一個列表的列表,憑藉JSON的數據下載。

該列表有176個元素,每個元素有33個嵌套元素,其中一些元素也是不同長度的列表。

我有興趣分析包含在特定嵌套列表中的數據,每個176有4個或5個元素,其中一些有4個,有些有5個。提取這個嵌套的興趣列表並將其轉換爲data.frame以便能夠執行一些分析。

在上面的代表性示例數據中,我對l的5個元素中的每個元素的嵌套列表d感興趣。因此,期望data.frame看起來是這樣的:

id   label   link  score externalId 
5   Utah  Asia/Anadyr -0.2110459   NA 
8 South Carolina Pacific/Wallis 0.5265409 -6.743544 
. 
. 

我一直在嘗試使用purrr這似乎對列表中的處理數據的合理和穩定的流量,但我遇到了錯誤,我不能完全瞭解原因 - 很可能是因爲我沒有正確理解purrr或列表(可能兩者)的命令/邏輯。這是我已經嘗試的代碼,但將引發相關的誤差:

df <- map_df(l, "d", ~as.data.frame(.)) 
Error: incompatible sizes (5 != 4) 

相信這具有的d每個組件的不同的長度,或者做不同的包含的數據(有時4個元素有時5 )或者我在這裏使用的函數是錯誤指定的 - 實際上我不完全確定。

我已經通過使用for循環來解決這個問題,我知道這是低效的,因此我的問題在這裏。

這是for循環我目前使用:

df <- data.frame(id = integer(), label = character(), score = numeric(), externalId = numeric()) 
for(i in seq_along(l)){ 
    df_temp <- l[[i]][[4]] %>% map_df(~as.data.frame(.)) 
    df <- rbind(df, df_temp) 
} 

一些援助最好用purrr - 或者一些版本的apply,因爲這仍然是優於我的for循環 - 將不勝感激。此外,如果有上述資源我想了解,而不是找到正確的代碼。

回答

6

可以分三個步驟做到這一點,首先拉出d,然後結合各行的d各個要素,然後再結合一切都變成一個單一的對象。

我使用bind_rowsdplyr作爲列表內行綁定。map_df進行最終的行綁定。

library(purrr) 
library(dplyr) 

l %>% 
    map("d") %>% 
    map_df(bind_rows) 

這也相當於:

map_df(l, ~bind_rows(.x[["d"]])) 

結果是這樣的:

# A tibble: 12 x 5 
     id   label     link  score externalId 
    <int>   <chr>    <chr>  <dbl>  <dbl> 
1  5   Utah   Asia/Anadyr -0.2110459   NA 
2  8 South Carolina  Pacific/Wallis 0.5265409 -6.743544 
3  9  Nebraska America/Scoresbysund 0.2508955 16.425747 
4  8  Georgia   America/Nome 0.5264941 7.915836 
5  2  Washington  America/Shiprock -0.5551864 15.068666 
6  6 North Dakota   Universal 1.0316830   NA 
7  1 New Hampshire  America/Cordoba 1.2158206 9.727642 
8  1   Alaska  Asia/Istanbul -0.2318326   NA 
9  4 Pennsylvania Africa/Dar_es_Salaam 0.5902453   NA 
10  3  Delaware  Asia/Samarkand 0.6955771 15.236482 
11  4 North Dakota  America/Tortola 1.0306027 -7.216669 
12  9  Nebraska  America/Ojinaga -1.1139800 -8.451451 
0

有關purrr更多的信息,我建議Grolemund韋翰的「R數據科學」 http://r4ds.had.co.nz/

我想你所面臨的一個問題是,一些在l$d項目是與每一個觀測的變量列表,隨時可以轉換爲數據框,而其他項目則是這些列表的列表。

但是我自己並不擅長咕嚕嚕。這是我會怎麼做:

l <- lapply(l, function(x){x$d}) ## work with the data you need. 

list_of_observations <- Filter(function(x) {!is.null(names(x))},l) 

list_of_lists <- Filter(function(x) {is.null(names(x))}, l) 

another_list_of_observations <- unlist(list_of_lists, recursive=FALSE) 

df <- lapply(c(list_of_observations, another_list_of_observations), 
      as.data.frame) %>% bind_rows