R：根據自定義距離函數和多個條件快速匹配記錄

我在R中進行了一些功能，將化學質譜（具有兩列整數質量和強度的矩陣）與基於自定義光譜相似性函數和化合物的所謂保留指數（即洗脫時間）的匹配（例如，參見http://webbook.nist.gov/cgi/cbook.cgi?ID=C630035&Mask=200）。爲此，每個記錄的列表元素「RI」必須與庫中的列表元素進行比較，並且當絕對偏差小於給定容差時，它應該將最佳譜庫匹配添加到我的記錄中。下面是我寫的一些代碼，但是問題在於它對於我的目的來說太慢了（我通常有大約1000個樣本譜和200 000個譜庫譜）。我嘗試了並行化，但這似乎也沒有多大幫助。我想如何讓代碼更高效，例如使用更多的矢量化，或使用內聯C代碼或其他R技巧？我知道這方面的一般建議，但在這種情況下並沒有看到如何輕鬆實現它（而且我不熟悉C，但不幸）......任何想法或建議？哦，是的，如何在使用sfLapply時添加進度條？它可能有助於將我的光譜放在一個很大的（稀疏的，因爲有很多零）矩陣中，以避免光譜相似度函數中的步驟，或者使用其他標準，例如僅考慮最大/查詢譜中最強烈的峯具有與譜庫譜相同的質量（或者包含在譜庫譜中5個最大峯的集合中）？無論如何，任何想法如何加快這項任務將不勝感激！R：根據自定義距離函數和多個條件快速匹配記錄

編輯：我仍然有一個殘留查詢是我如何避免使函數addbestlibmatches1中的示例記錄recs的完整副本recs，而是更改只有記錄的地方有一個庫匹配？另外，傳遞存有保留索引匹配的庫記錄的選擇可能效率不高（在函數addbestlibmatch中）。任何想法如何我可以避免這一點？

# EXAMPLE DATA 

rec1=list(RI=1100,spectrum=as.matrix(cbind(mz=c(57,43,71,41,85,56,55,70,42,84,98,99,39,69,58,113,156),int=c(999,684,396,281,249,173,122,107,94,73,51,48,47,47,37,33,32)))) 
randrec=function() list(RI=runif(1,1000,1200),spectrum=as.matrix(cbind(mz=seq(30,600,1),int=round(runif(600-30+1,0,999))))) 

# spectral library 
libsize=2000 # my real lib has 200 000 recs 
lib=lapply(1:libsize,function(i) randrec()) 
lib=append(list(rec1),lib) 

# sample spectra 
ssize=100 # I usually have around 1000 
s=lapply(1:ssize,function(i) randrec()) 
s=append(s,list(rec1)) # we add the first library record as the last sample spectrum, so this should match 



# SPECTRAL SIMILARITY FUNCTION 

SpecSim=function (ms1,ms2,log=F) { 
    alignment = merge(ms1,ms2,by=1,all=T) 
    alignment[is.na(alignment)]=0 
    if (nrow(alignment)!=0) { 
    alignment[,2]=100*alignment[,2]/max(alignment[,2]) # normalize base peak intensities to 100 
    alignment[,3]=100*alignment[,3]/max(alignment[,3]) 
    if (log==T) {alignment[,2]=log2(alignment[,2]+1);alignment[,3]=log2(alignment[,3]+1)} # work on log2 intensity scale if requested 
    u = alignment[,2]; v = alignment[,3] 
    similarity_score = as.vector((u %*% v)/(sqrt(sum(u^2)) * sqrt(sum(v^2)))) 
    similarity_score[is.na(similarity_score)]=-1 
    return(similarity_score)} else return(-1) } 



# FUNCTION TO CALCULATE SIMILARITY VECTOR OF SPECTRUM WITH LIBRARY SPECTRA 

SpecSimLib=function(spec,lib,log=F) { 
    sapply(1:length(lib), function(i) SpecSim(spec,lib[[i]]$spectrum,log=log)) } 



# FUNCTION TO ADD BEST MATCH OF SAMPLE RECORD rec IN SPECTRAL LIBRARY lib TO ORIGINAL RECORD 
# we only compare spectra when list element RI in the sample record is within tol of that in the library 
# when there is a spectral match > specsimcut within a RI devation less than tol, 
# we add the record nrs in the library with the best spectral matches, the spectral similarity and the RI deviation to recs 

addbestlibmatch=function(rec,lib,xvar="RI",tol=10,log=F,specsimcut=0.8) { 
    nohit=list(record=-1,specmatch=NA,RIdev=NA) 
    selected=abs(sapply(lib, "[[", xvar)-rec[[xvar]])<tol 
    if (sum(selected)!=0) { 
    specsims=SpecSimLib(rec$spectrum,lib[selected],log) # HOW CAN I AVOID PASSING THE RIGHT LIBRARY SUBSET EACH TIME? 
    maxspecsim=max(specsims) 
    if (maxspecsim>specsimcut) {besthsel=which(specsims==maxspecsim)[[1]] # nr of best hit among selected elements, in case of ties we just take the 1st hit 
           idbesth=which(selected)[[besthsel]] # record nr of best hit in library lib 
           return(modifyList(rec,list(record=idbesth,specsim=specsims[[besthsel]],RIdev=rec[[xvar]]-lib[[idbesth]][[xvar]])))} 
           else {return(rec)} } else {return(rec)} 
} 



# FUNCTION TO ADD BEST LIBRARY MATCHES TO RECORDS RECS 

library(pbapply) 
addbestlibmatches1=function(recs,lib,xvar="RI",tol=10,log=F,specsimcut=0.8) { 
    pblapply(1:length(recs), function(i) addbestlibmatch(recs[[i]],lib,xvar,tol,log,specsimcut)) 
} 

# PARALLELIZED VERSION 
library(snowfall) 
addbestlibmatches2=function(recs,lib,xvar="RI",tol=10,log=F,specsimcut=0.8,cores=4) { 
    sfInit(parallel=TRUE,cpus=cores,type="SOCK") 
    sfExportAll() 
    sfLapply(1:length(recs), function(i) addbestlibmatch(recs[[i]],lib,xvar,tol,log,specsimcut)) 
    sfStop() 
} 



# TEST TIMINGS 

system.time(addbestlibmatches1(s,lib)) 
#|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% 
#user system elapsed 
#83.60 0.06 83.82 

system.time(addbestlibmatches2(s,lib)) 
#user system elapsed - a bit better, but not much 
#2.59 0.74 42.37

來源

2014-07-07 Tom Wenseleers

簡單的進度條：'pb < - txtProgressBar（0，100，style = 3）; sapply（1：100，function（i）{setTxtProgressBar（pb，i）; Sys.sleep（0.05）; i}）;關閉（pb）' – tonytonov

關於性能，您可能需要更深入的分析才能確定瓶頸。您可以從Advanced R的[本章]（http://adv-r.had.co.nz/Profiling.html）開始。 – tonytonov

Thx - 剛試過用devtools :: install_github（「hadley/lineprof」）庫（lineprof） l = lineprof（addbestlibmatches1（s，lib）） shine（l），但只是得到一個空屏幕和一個警告文件中的警告（con，「r」）： file（「」）只支持open =「w +」和open =「w + b」：使用前者有什麼想法？我認爲加速我的代碼的主要關鍵是矢量化和避免複製的東西，但不知道如何在這種情況下做到這一點... –

沒有看所有詳細的代碼，我覺得有空間在SpecSim功能的改善沒有去所有的C++呢。您正在使用合併，這會將您的矩陣強制爲data.frames。這對性能來說總是很糟糕。大部分代碼時間可能在merge（）和子集中。數據框架子集化速度慢，矩陣或矢量速度快。

SpecSim2 <- function (ms1,ms2,log=F) { 
    i <- unique(c(ms1[,1], ms2[,1])) 
    y <- x <- numeric(length(i)) 
    x[match(ms1[,1], i)] <- ms1[, 2] 
    y[match(ms2[,1], i)] <- ms2[, 2] 
    x <- 100 * x/max(x) 
    y <- 100 * y/max(y) 
    if (log){ 
    x <- log2(x + 1) 
    y <- log2(y + 1) 
    } 
    similarity.score <- x %*% y/(sqrt(sum(x^2)) * sqrt(sum(y^2))) 
    if (is.na(similarity.score)){ 
    return(-1) 
    } else { 
    return(similarity.score) 
    } 
}

下面是重寫和原始的，分別計時部分：

> system.time(addbestlibmatches1(s,lib)) 
    |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% 
    user system elapsed 
    4.16 0.00 4.17 

> system.time(addbestlibmatches1(s,lib)) 
    |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% 
    user system elapsed 
    34.25 0.02 34.34

可能不會得到你所需要的速度，但比8倍的改善...

至於如何處理addbestlibmatches（），我認爲你需要重新考慮這個問題是一個矩陣問題。不是使用列表來保存庫，而是使用矢量作爲庫和樣本的RI值。然後可以執行初始屏幕是這樣的：

selected <- outer(sRI, libRI, FUN = '-') < 10

如果庫是一個大的矩陣（羣衆X光譜），然後就可以子集選擇的光譜庫並計算樣品和整個部分之間的距離這是在一次通過選擇這樣的庫：

SpecSimMat <- function(x, lib, log = F){ 
    stopifnot(require(matrixStats)) 
    x <- 100 * x/max(x) 
    lib <- sweep(lib, 2, colMaxs(lib)) 
    x %*% lib/(sqrt(sum(x^2)) * sqrt(colSums(lib^2)))

}

其中x是一個樣品和IIb是所選擇的光譜（羣衆X光譜）的基質中。通過這種方式，您可以對矩陣進行子集（快速），然後執行一系列矩陣運算（快速）。

來源

2014-07-07 17:50:25 oropendola

哈，真是太棒了 - 非常感謝！大部分時間都花在計算相似性分數上，所以我可以專注於加速那部分...... –

哈哈，還有一個我仍然有的殘留查詢是我如何避免做一個完整的副本樣本記錄會在函數addbestlibmatches1中記錄，但是隻會更改存在庫匹配的記錄嗎？另外，傳遞保留索引匹配的庫記錄選擇副本可能效率不高（在函數addbestlibmatch中）。任何想法如何我可以避免這一點？ –

R：根據自定義距離函數和多個條件快速匹配記錄

回答

相關問題