fileA
包含區間(開始,結束),以及分配給該時間間隔(值)的值。的R - 輸出重疊的間隔
start end value
0 123 1 #value 1 at positions 0 to 122 included.
123 78000 0 #value 0 at positions 123 to 77999 included.
78000 78004 56 #value 56 at positions 78000, 78001, 78002 and 78003.
78004 78005 12 #value 12 at position 78004.
78005 78006 1 #value 1 at position 78005.
78006 78008 21 #value 21 at positions 78006 and 78007.
78008 78056 8 #value 8 at positions 78008 to 78055 included.
78056 81000 0 #value 0 at positions 78056 to 80999 included.
fileB
包含我感興趣的間隔的列表。我想從fileA
檢索重疊的間隔。開始和結束不一定匹配。下面是fileB
一個例子:
start end label
77998 78005 romeo
78007 78012 juliet
的目標是(1)從fileA
與fileB
和(2)連接到相應的標籤從fileB
追加重疊檢索間隔。預期的結果是(#指定被丟棄的線,這是爲了幫助實現可視化,並不會在最終輸出):
start end value label
#
123 78000 0 romeo
78000 78004 56 romeo
78004 78005 12 romeo
#
78006 78008 21 juliet
78008 78056 8 juliet
#
這是我在寫代碼的嘗試:
#read from tab-delimited text files which do not contain column names
A<-read.table("fileA.txt",sep="\t",colClasses=c("numeric","numeric","numeric"))
B<-read.table("fileB.txt",sep="\t",colClasses=c("numeric","numeric","character"))
#add column names
colnames(A)<-c("start","end","value")
colnames(B)<-c("start","end","label")
#output intervals in `fileA` that overlap with an interval in `fileB`
A_overlaps<-A[((A$start <= B$start & A$end >= B$start)
|(A$start >= B$start & A$start <= B$end)
|(A$end >= B$start & A$end <= B$end)),]
在這一點上我已經得到意想不到的結果:
> A_overlaps
start end value
#missing
3 78000 78004 56
5 78005 78006 1 #this line should not be here
6 78006 78008 21
#missing
我沒有寫部分輸出尚未標籤,因爲我還不如解決這個第一,但我想不出什麼,我剛開克錯...
[編輯] 我也試過以下,但它只是輸出的fileA
全部:
A_overlaps <- A[(min(A$start,A$end) < max(B$start,B$end)
& max(A$start,A$end) > min(B$start,B$end)),]
有一個間隔包 – JeremyS