2017-05-05 24 views
0

簡單地說,我有每行中包含的項目類型的數據幀的一個列表進行分類:如何根據命名載體(〜本體)

df <- data.frame(
    item = 1:5, 
    type = c("apple", "orange", "onion", "lettuce", "chicken") 
) 

欲每個項目歸類到一個較高層次根據類別定義的類別,根據每個類別的可能類型列表。我知道所有可能的類型(或可以用df$type %>% levels()提取它們)。

1)我應該如何結構中的「本體論」 /「字典」列出所有可能的值,爲每個類別?我想到了一個名爲列表的列表,但我不知道什麼是做到這一點的最好辦法。

ontology = c(
    "fruit" = c("apple", "orange", "banana"), 
    "vegetable" = c("onion", "lettuce", "tomato"), 
    "meat" = c("chicken", "beef") 
) 

2)我應該如何創建我的數據幀每個分類類型的變量category

# Basic attempt... 
df %>% 
    mutate(category = str_match(type %in% ontology)) 

預期結果:

df 
# item type category 
# 1 apple  fruit 
# 2 orange  fruit 
# 3 onion vegetable 
# 4 lettuce vegetable 
# 5 chicken  meat 

回答

2

這裏是一個基R法與match,不公開和gsub

# flatten ontology list to named atomic vector where name is category with added digit 
flat <- unlist(ontology) 
# match position of df$type in flat ontology, pull out name, and remove numeric digit 
df$category <- sub("\\d+$", "", names(flat)[match(df$type, flat)]) 
df 
    item type category 
1 1 apple  fruit 
2 2 orange  fruit 
3 3 onion vegetable 
4 4 lettuce vegetable 
5 5 chicken  meat 
1

你可以把ontology到查找表:

library(tidyverse) 

df <- data.frame(
    item = 1:5, 
    type = c("apple", "orange", "onion", "lettuce", "chicken") 
) 

lookup <- list( # use list to avoid suffixes on names 
    "fruit" = c("apple", "orange", "banana"), 
    "vegetable" = c("onion", "lettuce", "tomato"), 
    "meat" = c("chicken", "beef") 
) %>% 
    imap(~set_names(rep_along(.x, .y), .x)) %>% # reverse names and objects 
    flatten_chr() # simplify to character vector 

lookup 
#>  apple  orange  banana  onion  lettuce  tomato 
#>  "fruit"  "fruit"  "fruit" "vegetable" "vegetable" "vegetable" 
#>  chicken  beef 
#>  "meat"  "meat" 

這使得分類只是一個子集的事:

df %>% mutate(category = lookup[type]) 
#> item type category 
#> 1 1 apple  fruit 
#> 2 2 orange vegetable 
#> 3 3 onion vegetable 
#> 4 4 lettuce  fruit 
#> 5 5 chicken  fruit 
+0

我喜歡你的解決方案的可讀性,尤其是'最後查找[類型]'!但我無法使用'IMAP()''因爲庫(tidyverse)'目前生產的誤差(由於更新與gcc 7和R dylib想要加載libgfortran.3.dylib ... - 我會等待對於更新,我想),我無法在'purrr'中找到'imap()'。 –

+1

它真的等同於'foo%>%map2(names(。),〜...)',它應該可以工作不管你有什麼版本的嗚嗚聲。 – alistaire