我在R中進行了一些功能,將化學質譜(具有兩列整數質量和強度的矩陣)與基於自定義光譜相似性函數和化合物的所謂保留指數(即洗脫時間)的匹配(例如,參見http://webbook.nist.gov/cgi/cbook.cgi?ID=C630035&Mask=200)。爲此,每個記錄的列表元素「RI」必須與庫中的列表元素進行比較,並且當絕對偏差小於給定容差時,它應該將最佳譜庫匹配添加到我的記錄中。下面是我寫的一些代碼,但是問題在於它對於我的目的來說太慢了(我通常有大約1000個樣本譜和200 000個譜庫譜)。我嘗試了並行化,但這似乎也沒有多大幫助。我想如何讓代碼更高效,例如使用更多的矢量化,或使用內聯C代碼或其他R技巧?我知道這方面的一般建議,但在這種情況下並沒有看到如何輕鬆實現它(而且我不熟悉C,但不幸)......任何想法或建議?哦,是的,如何在使用sfLapply
時添加進度條?它可能有助於將我的光譜放在一個很大的(稀疏的,因爲有很多零)矩陣中,以避免光譜相似度函數中的步驟,或者使用其他標準,例如僅考慮最大/查詢譜中最強烈的峯具有與譜庫譜相同的質量(或者包含在譜庫譜中5個最大峯的集合中)?無論如何,任何想法如何加快這項任務將不勝感激!R:根據自定義距離函數和多個條件快速匹配記錄
編輯:我仍然有一個殘留查詢是我如何避免使函數addbestlibmatches1中的示例記錄recs的完整副本recs,而是更改只有記錄的地方有一個庫匹配?另外,傳遞存有保留索引匹配的庫記錄的選擇可能效率不高(在函數addbestlibmatch中)。任何想法如何我可以避免這一點?
# EXAMPLE DATA
rec1=list(RI=1100,spectrum=as.matrix(cbind(mz=c(57,43,71,41,85,56,55,70,42,84,98,99,39,69,58,113,156),int=c(999,684,396,281,249,173,122,107,94,73,51,48,47,47,37,33,32))))
randrec=function() list(RI=runif(1,1000,1200),spectrum=as.matrix(cbind(mz=seq(30,600,1),int=round(runif(600-30+1,0,999)))))
# spectral library
libsize=2000 # my real lib has 200 000 recs
lib=lapply(1:libsize,function(i) randrec())
lib=append(list(rec1),lib)
# sample spectra
ssize=100 # I usually have around 1000
s=lapply(1:ssize,function(i) randrec())
s=append(s,list(rec1)) # we add the first library record as the last sample spectrum, so this should match
# SPECTRAL SIMILARITY FUNCTION
SpecSim=function (ms1,ms2,log=F) {
alignment = merge(ms1,ms2,by=1,all=T)
alignment[is.na(alignment)]=0
if (nrow(alignment)!=0) {
alignment[,2]=100*alignment[,2]/max(alignment[,2]) # normalize base peak intensities to 100
alignment[,3]=100*alignment[,3]/max(alignment[,3])
if (log==T) {alignment[,2]=log2(alignment[,2]+1);alignment[,3]=log2(alignment[,3]+1)} # work on log2 intensity scale if requested
u = alignment[,2]; v = alignment[,3]
similarity_score = as.vector((u %*% v)/(sqrt(sum(u^2)) * sqrt(sum(v^2))))
similarity_score[is.na(similarity_score)]=-1
return(similarity_score)} else return(-1) }
# FUNCTION TO CALCULATE SIMILARITY VECTOR OF SPECTRUM WITH LIBRARY SPECTRA
SpecSimLib=function(spec,lib,log=F) {
sapply(1:length(lib), function(i) SpecSim(spec,lib[[i]]$spectrum,log=log)) }
# FUNCTION TO ADD BEST MATCH OF SAMPLE RECORD rec IN SPECTRAL LIBRARY lib TO ORIGINAL RECORD
# we only compare spectra when list element RI in the sample record is within tol of that in the library
# when there is a spectral match > specsimcut within a RI devation less than tol,
# we add the record nrs in the library with the best spectral matches, the spectral similarity and the RI deviation to recs
addbestlibmatch=function(rec,lib,xvar="RI",tol=10,log=F,specsimcut=0.8) {
nohit=list(record=-1,specmatch=NA,RIdev=NA)
selected=abs(sapply(lib, "[[", xvar)-rec[[xvar]])<tol
if (sum(selected)!=0) {
specsims=SpecSimLib(rec$spectrum,lib[selected],log) # HOW CAN I AVOID PASSING THE RIGHT LIBRARY SUBSET EACH TIME?
maxspecsim=max(specsims)
if (maxspecsim>specsimcut) {besthsel=which(specsims==maxspecsim)[[1]] # nr of best hit among selected elements, in case of ties we just take the 1st hit
idbesth=which(selected)[[besthsel]] # record nr of best hit in library lib
return(modifyList(rec,list(record=idbesth,specsim=specsims[[besthsel]],RIdev=rec[[xvar]]-lib[[idbesth]][[xvar]])))}
else {return(rec)} } else {return(rec)}
}
# FUNCTION TO ADD BEST LIBRARY MATCHES TO RECORDS RECS
library(pbapply)
addbestlibmatches1=function(recs,lib,xvar="RI",tol=10,log=F,specsimcut=0.8) {
pblapply(1:length(recs), function(i) addbestlibmatch(recs[[i]],lib,xvar,tol,log,specsimcut))
}
# PARALLELIZED VERSION
library(snowfall)
addbestlibmatches2=function(recs,lib,xvar="RI",tol=10,log=F,specsimcut=0.8,cores=4) {
sfInit(parallel=TRUE,cpus=cores,type="SOCK")
sfExportAll()
sfLapply(1:length(recs), function(i) addbestlibmatch(recs[[i]],lib,xvar,tol,log,specsimcut))
sfStop()
}
# TEST TIMINGS
system.time(addbestlibmatches1(s,lib))
#|++++++++++++++++++++++++++++++++++++++++++++++++++| 100%
#user system elapsed
#83.60 0.06 83.82
system.time(addbestlibmatches2(s,lib))
#user system elapsed - a bit better, but not much
#2.59 0.74 42.37
簡單的進度條:'pb < - txtProgressBar(0,100,style = 3); sapply(1:100,function(i){setTxtProgressBar(pb,i); Sys.sleep(0.05); i});關閉(pb)' – tonytonov
關於性能,您可能需要更深入的分析才能確定瓶頸。您可以從Advanced R的[本章](http://adv-r.had.co.nz/Profiling.html)開始。 – tonytonov
Thx - 剛試過用devtools :: install_github(「hadley/lineprof」 ) 庫(lineprof) l = lineprof(addbestlibmatches1(s,lib)) shine(l),但只是得到一個空屏幕和一個警告文件中的警告(con,「r」): file(「」)只支持open =「w +」和open =「w + b」:使用前者 有什麼想法?我認爲加速我的代碼的主要關鍵是矢量化和避免複製的東西,但不知道如何在這種情況下做到這一點... –