2015-12-24 104 views
0

我有一個數據幀mydf。我也有一個叫做myvec <- c("chr5:11", "chr3:112", "chr22:334")的載體。如果任何向量元素與mydf中的鍵相匹配並且生成mydfresult)的子集,我想要做的是選擇行的範圍(包括上面的3個值和下面的3個值)。如何選擇R的行範圍

由於在myvec我們CHR5:11匹配與mydf關鍵,我們選擇行匹配CHR5:8(下面三個值),以CHR5:14(上述三個值)在result

mydf<- structure(list(key = structure(c(5L, 2L, 7L, 8L, 4L, 1L, 6L, 
3L, 11L, 10L, 9L), .Names = c("34", "35", "36", "37", "38", "39", 
"40", "41", "42", "43", "44"), .Label = c("chr5:10", "chr5:11", 
"chr5:1123", "chr5:118", "chr5:12", "chr5:123", "chr5:13", "chr5:14", 
"chr5:19", "chr5:8", "chr5:9"), class = "factor"), variantId = structure(1:11, .Names = c("34", 
"35", "36", "37", "38", "39", "40", "41", "42", "43", "44"), .Label = c("9920068", 
"9920069", "9920070", "9920071", "9920072", "9920073", "9920074", 
"9920075", "9920076", "9920077", "9920078"), class = "factor")), .Names = c("key", 
"variantId"), row.names = c("34", "35", "36", "37", "38", "39", 
"40", "41", "42", "43", "44"), class = "data.frame") 

結果

 key   variant 
43 "chr5:8" "9920077" 
42 "chr5:9" "9920076" 
39 "chr5:10" "9920073" 
35 "chr5:11" "9920069" 
34 "chr5:12" "9920068" 
36 "chr5:13" "9920070" 
37 "chr5:14" "9920071" 
+1

根據您的dput,'mydf'是一個矩陣,而不是一個data.frame 。請修復。 –

+0

@Pascal謝謝,我已修復它。 – MAPK

回答

2

可以使用GenomicRanges包。

library(GenomicRanges) 

myvec <- c("chr5:11", "chr3:112", "chr22:334") 
myvec.gr <- GRanges(gsub(":.+", "", myvec), 
        IRanges(as.numeric(gsub(".+:", "", myvec))-3, 
          as.numeric(gsub(".+:", "", myvec)))+3) 

mydf.gr <- GRanges(gsub(":.+", "", mydf[,"key"]), 
        IRanges(as.numeric(gsub(".+:", "", mydf[,"key"])), 
          as.numeric(gsub(".+:", "", mydf[,"key"])))) 

d.v.op <- findOverlaps(mydf.gr, myvec.gr) 

mydf[queryHits(d.v.op), ] 
# key  variantId 
# 34 "chr5:12" "9920068" 
# 35 "chr5:11" "9920069" 
# 36 "chr5:13" "9920070" 
# 37 "chr5:14" "9920071" 
# 39 "chr5:10" "9920073" 
# 42 "chr5:9" "9920076" 
# 43 "chr5:8" "9920077" 
+0

非常感謝,我認爲格蘭傑有多種用途。 – MAPK

3

如何以下(我用data.tablebase版本幾乎是相同的)

library(data.table) 
mydf <- as.data.table(mydf) #(if mydf really is stored as a matrix currently) 

myvec2 <- lapply(strsplit(gsub("chr", "", myvec), split=":"), as.integer) 

mydf[unique(Reduce(c, sapply(myvec2, function(x){ 
    which(key %in% paste0("chr", x[1], ":", seq((x2 <- x[2]) - 3L, x2 + 3L)))} 
))), ] 

(在base,更換as.data.tableas.data.framekeymydf$key,並更換右方括號],]

用於分類的額外選項

其實,我認爲這個選項總的來說比較好,因爲它首先以更柔韌的方式存儲您的信息。這個版本在data.table說法中有點重。

mydf <- as.data.table(mydf) 

#Split your `key` variable into its pre- and post-colon components 
# (of course using better names if those numbers mean something 
# more specific to you) 
mydf[ , c("chr", "sub") := 
     .(as.integer(gsub("chr|:.*", "", key)), 
      as.integer(gsub(".*:", "", key)))] 

現在,有輕微的調整像往常一樣繼續:

myvec2<-lapply(strsplit(gsub("chr","",myvec),split=":"),as.integer) 

mydf[unique(Reduce(c, sapply(myvec2, function(x){ 
    which(chr == x[1] & sub %in% seq((x2 <- x[2]) - 3L, x2 + 3L))} 
)))][order(chr, sub)] 

輸出:

 key variantId chr sub 
1: chr5:8 9920077 5 8 
2: chr5:9 9920076 5 9 
3: chr5:10 9920073 5 10 
4: chr5:11 9920069 5 11 
5: chr5:12 9920068 5 12 
6: chr5:13 9920070 5 13 
7: chr5:14 9920071 5 14 
+0

@Pascal固定。 OP:按照你想要的順序來打印東西很困難(而非間接)。這很關鍵嗎? – MichaelChirico

+0

謝謝,順序只是保持升序。我想我現在可以使用排序選項。 – MAPK

+0

@MAPK問題是,因爲它存儲爲一個字符串,'sort'實際上不能正確工作 - '「chr5:1123」'緊接在'「chr5:11」之後。 – MichaelChirico