優化R中的匹配

希望有人可以提供幫助。我在R中有很多ortholog映射，這被證明是非常耗時的。我已經在下面發佈了一個示例結構。顯而易見的答案，例如逐行迭代（用於i：1：nrow（df））和字符串分割，或者使用sapply已經嘗試過，速度非常慢。因此，我希望有一個向量化的選項。優化R中的匹配

stringsasFactors = F 

# example accession mapping 
map <- data.frame(source = c("1", "2 4", "3", "4 6 8", "9"), 
        target = c("a b", "c", "d e f", "g", "h i")) 

# example protein list 
df <- data.frame(sourceIDs = c("1 2", "3", "4", "5", "8 9")) 

# now, map df$sourceIDs to map$target 


# expected output 
> matches 
[1] "a b c" "d e f" "g"  ""  "g h i"

我感謝任何幫助！

來源

2017-06-16 user8173495

在大多數情況下，解決此類問題的最佳方法是每行創建一個觀察值的數據框架。

map_split <- lapply(map, strsplit, split = ' ') 
long_mappings <- mapply(expand.grid, map2$source, map2$target, SIMPLIFY = FALSE) 
all_map <- do.call(rbind, long_mappings) 
names(all_map) <- c('source', 'target')

現在all_map看起來是這樣的：

source target 
1  1  a 
2  1  b 
3  2  c 
4  4  c 
5  3  d 
6  3  e 
7  3  f 
8  4  g 
9  6  g 
10  8  g 
11  9  h 
12  9  i

做着df一樣...

sourceIDs_split <- strsplit(df$sourceIDs, ' ') 
df_long <- data.frame(
    index = rep(seq_along(sourceIDs_split), lengths(sourceIDs_split)), 
    source = unlist(sourceIDs_split) 
)

讓我們在這爲df_long：

index source 
1  1  1 
2  1  2 
3  2  3 
4  3  4 
5  4  5 
6  5  8 
7  5  9

現在他們只需要合併和摺疊。

matches <- merge(df_long, all_map, by = 'source', all.x = TRUE) 
tapply(
    matches$target, 
    matches$index, 
    function(x) { 
    paste0(sort(x), collapse = ' ') 
    } 
) 

#  1  2  3  4  5 
# "a b c" "d e f" "c g"  "" "g h i"

來源

2017-06-16 19:58:36

'lapply（地圖，strsplit，分裂='「）'給我一個錯誤。這對你有用嗎？ – CPak

這是一個很好的解決方案。謝謝。 – user8173495

@ChiPak我從原來的例子中假設'options（stringsAsFactors = FALSE）'。如果「map」的列是因素，我的解決方案將不起作用。 –

優化R中的匹配

回答

相關問題