2014-02-12 55 views
2

我有一個從bedGraph文件導入到GRanges對象的全基因組ChIP-seq信號。我想繪製覆蓋所有峯值的固定寬度區間的平均信號。我怎樣才能將信號提取到數字矢量中,以便我可以對其進行平均?R中有幾個區間的平均信號

通過舉例的方式考慮:

library(GenomicRanges) 
set.seed(1) 

signal <- GRanges(
    seqnames = Rle(c("chr1"), c(10)), 
    ranges = IRanges(1:10*10, end = 1:10*10+5), 
    score = runif(10)) 

intervals <- GRanges(
    seqnames = Rle(c("chr1"), c(5)), 
    ranges = IRanges(1:5*20 + floor(runif(5)*4), width = 10)) 

所以信號的樣子:

GRanges with 10 ranges and 1 metadata column: 
     seqnames  ranges strand |    score 
      <Rle> <IRanges> <Rle> |   <numeric> 
    [1]  chr1 [ 10, 15]  * | 0.2655086631421 
    [2]  chr1 [ 20, 25]  * | 0.37212389963679 
    [3]  chr1 [ 30, 35]  * | 0.572853363351896 
    [4]  chr1 [ 40, 45]  * | 0.908207789994776 
    [5]  chr1 [ 50, 55]  * | 0.201681931037456 
    [6]  chr1 [ 60, 65]  * | 0.898389684967697 
    [7]  chr1 [ 70, 75]  * | 0.944675268605351 
    [8]  chr1 [ 80, 85]  * | 0.660797792486846 
    [9]  chr1 [ 90, 95]  * | 0.62911404389888 
    [10]  chr1 [100, 105]  * | 0.0617862704675645 
    --- 
    seqlengths: 
    chr1 
    NA 

和間隔的樣子:

GRanges with 5 ranges and 0 metadata columns: 
     seqnames  ranges strand 
     <Rle> <IRanges> <Rle> 
    [1]  chr1 [ 20, 29]  * 
    [2]  chr1 [ 40, 49]  * 
    [3]  chr1 [ 62, 71]  * 
    [4]  chr1 [ 81, 90]  * 
    [5]  chr1 [103, 112]  * 
    --- 
    seqlengths: 
    chr1 
    NA 

所以我想以平均矢量:

Rle(c(0.372, 0), c(6, 4))   # [ 20, 29] 
Rle(c(0.908, 0), c(6, 4))   # [ 40, 49] 
Rle(c(0.898, 0, 0.945), c(4, 4, 2)) # [ 62, 71] 
Rle(c(0.661, 0, 0.629), c(5, 4, 1)) # [ 81, 90] 
Rle(c(0.061, 0), c(3, 7))   # [103,112] 

我該如何做到這一點,而無需for循環和大量繁瑣易錯的區間算術?我希望GenomicRanges軟件包能夠包含這種功能,但我在手冊中看不到它。我一直在嘗試使用subsetByOverlaps,但這似乎並沒有將信號分數帶入結果中,也沒有幫助提取上面的Rle向量。

回答

2

我想我可能已經想通了。我可以間隔地將以下getScores()函數應用於每個範圍。該功能使用findOverlaps作爲改編自這個答案https://stackoverflow.com/a/9913411/959926

getScores <- function(interval) { 
    scores <- Rle(0, width(interval)) 
    bases <- GRanges(
     seqnames = seqnames(interval), 
     ranges = IRanges(start(interval):end(interval), width = 1)) 
    overlaps <- findOverlaps(signal, bases) 
    scores[start(bases)[subjectHits(overlaps)] - start(interval) + 1] <- score(signal)[queryHits(overlaps)] 
    scores 
} 
Reduce('+', sapply(split(intervals, 1:length(intervals)), getScores))/length(intervals) 

它似乎工作至今,但任何改進將受到歡迎。例如,當信號和/或間隔很長時,它非常緩慢。

0
overlaps <- findOverlaps(signal, intervals) 
sites <- signal[queryHits(overlaps)] 
intervals$averagedSignal <- aggregate(score(sites), list(subjectHits(overlaps)), mean)