2017-02-22 173 views
1

我工作的RNA序列數據,並試圖通過基因型繪製平均覆蓋率曲線,類似於什麼都是在這裏完成每基因型GenomicRanges添加覆蓋

RNA測序覆蓋率(來源:pickrell等,Nature,2010 )

enter image description here

要做到這一點的情節,我已經從100個個人要人文件,包含從RNA-seq的數據覆蓋信息(在特定區域),並且我在R讀取,如GenomicRanges對象。

這給了我GRANGES對象,如在下面的實施例玩具得到的那些:

GR1 = GRANGES(seqname的= 1,範圍= IRanges(開始= C(1,5,10,15,30 ,55),端= C(4,9,14,29,39,60)))

GR1 $ COV = C(3,1,8,6,2,10)

GR2 = GRANGES(seqname的= 1,範圍= IRanges(開始= C(3,20,24),結束= C(7,23,26)))

GR2 $ COV = C(3,5,3)

開始=唯一的(排序(C(範圍(GR1)@開始,範圍(GR2)@啓動)))

GR1

GRanges object with 6 ranges and 1 metadata column: 
seqnames ranges strand |  cov 
    <Rle> <IRanges> <Rle> | <numeric> 
     1 [ 1, 4]  * |   3 
     1 [ 5, 9]  * |   1 
     1 [10, 14]  * |   8 
     1 [15, 29]  * |   6 
     1 [30, 39]  * |   2 
     1 [55, 60]  * |  10 
     ------- 
seqinfo: 1 sequence from an unspecified genome; no seqlengths 

GR2

GRanges object with 3 ranges and 1 metadata column: 
seqnames ranges strand |  cov 
    <Rle> <IRanges> <Rle> | <numeric> 
     1 [ 3, 7]  * |   3 
     1 [20, 23]  * |   5 
     1 [24, 26]  * |   3 
     ------- 
seqinfo: 1 sequence from an unspecified genome; no seqlengths 

問題是,我有這些每個人(gr1和gr2將2個不同的個人),我想結合他們創造基因組範圍的對象,它給我的全覆蓋,在不同個體,1和2 會看,因爲每個位置如下:

GR3

GRanges object with 6 ranges and 1 metadata column: 
seqnames ranges strand |  cov 
    <Rle> <IRanges> <Rle> | <numeric> 
     1 [ 1, 2]  * |   3 
     1 [ 3, 4]  * |   6 (=3+3) 
     1 [ 5, 7]  * |   4 (=1+3) 
     1 [ 8, 9]  * |   1 
     1 [10, 14]  * |   8 
     1 [15, 19]  * |   6 
     1 [20, 23]  * |   11 (=6+5) 
     1 [24, 26]  * |   9 (=6+3) 
     1 [27, 29]  * |   6 
     1 [30, 39]  * |   2 
     1 [55, 60]  * |  10 

有誰知道一個簡單的方法這個 ?或者我註定了?

感謝您的回答。

PS: 我的數據並沒有擱淺,但如果你有它的滯留數據,它會更好。

PPS:理想情況下,我還希望能夠進行乘法運算,或者應用帶有兩個參數x和y的任何函數,而不是簡單地添加覆蓋範圍。

回答

0

已經差不多一年了,但這是我未來參考的答案。

每當我找不到一個函數來直接執行此任務時,我只需將GRanges對象展開爲單bp分辨率。這使我可以在元數據列上執行任何所需的操作,將其視爲簡單的data.frame列,因爲IRanges現在在兩個對象之間相匹配。

在這個問題上的特殊情況下,以下的作品。

### Sort seqlevels 
# (not necessary here, but in real world examples, 
# with multiple sequences, you will want to do this) 
gr1 <- sort(GenomeInfoDb::sortSeqlevels(gr1)) 
gr2 <- sort(GenomeInfoDb::sortSeqlevels(gr2)) 

### Add seqlengths 
# (this corresponds to the actual sequence lengths; 
# here we use the highest position between the two objects: 60) 
seqlengths(gr1) <- 60 

### Make 1-bp tiles covering the genome 
# (using either one of gr1 and gr2 as a reference) 
bins <- GenomicRanges::tileGenome(GenomeInfoDb::seqlengths(gr1), 
            tilewidth=1, 
            cut.last.tile.in.chrom=TRUE) 

### Get coverage signal as Rle object 
gr1_cov <- coverage(gr1, weight="cov") 
gr2_cov <- coverage(gr2, weight="cov") 

### Get average coverage in each bin 
# (since the bins are 1-bp wide, this just keeps the original coverage value) 
gr1_bins <- GenomicRanges::binnedAverage(bins, gr1_cov, "binned_cov") 
gr2_bins <- GenomicRanges::binnedAverage(bins, gr2_cov, "binned_cov") 

### Make final object: 
# We can now sum the values in the metadata columns 
# Addressing the PPS, you could do any other operation or apply a function 
gr3 <- gr1_bins 
gr3$binned_cov <- gr1_bins$binned_cov + gr2_bins$binned_cov 

這產生最終GRanges對象在單鹼基分辨率。

> gr3 

GRanges object with 60 ranges and 1 metadata column: 
    seqnames ranges strand | binned_cov 
     <Rle> <IRanges> <Rle> | <numeric> 
[1]  1 [1, 1]  * |   3 
[2]  1 [2, 2]  * |   3 
[3]  1 [3, 3]  * |   6 
[4]  1 [4, 4]  * |   6 
[5]  1 [5, 5]  * |   4 
...  ...  ... ... .  ... 
[56]  1 [56, 56]  * |   10 
[57]  1 [57, 57]  * |   10 
[58]  1 [58, 58]  * |   10 
[59]  1 [59, 59]  * |   10 
[60]  1 [60, 60]  * |   10 
------- 
seqinfo: 1 sequence from an unspecified genome 

壓縮它,並在問題得到確切gr3,我們可以做到以下幾點。

### Compress back to variable-width IRanges (by cov) 
gr3_Rle <- coverage(gr3, weight='binned_cov') 
gr3 <- as(gr3_Rle, "GRanges") 

### Drop 0-score rows 
gr3 <- gr3[gr3$score > 0] 

### Rename metadata column 
names(mcols(gr3)) <- 'cov' 

> gr3 

GRanges object with 11 ranges and 1 metadata column: 
     seqnames ranges strand |  cov 
      <Rle> <IRanges> <Rle> | <numeric> 
    [1]  1 [ 1, 2]  * |   3 
    [2]  1 [ 3, 4]  * |   6 
    [3]  1 [ 5, 7]  * |   4 
    [4]  1 [ 8, 9]  * |   1 
    [5]  1 [10, 14]  * |   8 
    [6]  1 [15, 19]  * |   6 
    [7]  1 [20, 23]  * |  11 
    [8]  1 [24, 26]  * |   9 
    [9]  1 [27, 29]  * |   6 
    [10]  1 [30, 39]  * |   2 
    [11]  1 [55, 60]  * |  10 
    ------- 
    seqinfo: 1 sequence from an unspecified genome