2014-01-16 64 views
0

我試圖根據兩個數據框中包含的值是否落在第二個範圍內的兩個數據框中的數據(它不是真正的合併或連接)。基於範圍內的值組合數據幀行

爲方便起見,數據已在文章末尾。 一個數據幀(DF1)看起來像這樣:

 Chromosome Position P.value start.range end.range name 
       2 4553493 8.23e-05  4453493 4653493 A 
       3 24548810 1.04e-04 24448810 24648810 B 
       1 9952003 2.09e-04  9852003 10052003 C 

第二df是長得多,但頭(DF2)看起來像這樣:

  ensembl_gene_id chromosome_name start_position end_position 
      OS01G0281600    1  10048273  10050309 
      OS01G0281400    1  10021423  10027120 
      OS01G0281301    1  10019633  10020376 
      OS01G0281200    1  10011875  10015468 
      OS01G0281100    1  10008075  10011595 
      OS01G0281000    1  10003952  10007742 

我需要匹配從各IF行DF1 $位置爲DF2 $ START_POSITION或DF2 $ end_position 100000(內即((df1$Position - df2$start_position)<100000 | (df1$Position - df2$end_position)<100000)

我需要的,作爲輸出,列表或匹配行的數據幀,將會有匹配DF1多個DF2值,和每條染色體有多個條目,儘管df1 $ name是唯一的。我一直在嘗試ddply和自定義函數的各種應用程序,但是我很快就會做出來。有任何想法嗎?

數據:

df1 <- structure(list(Chromosome = c(2L, 3L, 1L), Position = c(4553493L, 
24548810L, 9952003L), P.value = c(8.23e-05, 0.000104, 0.000209 
), start.range = c(4453493, 24448810, 9852003), end.range = c(4653493, 
24648810, 10052003), name = c("A", "B", "C")), .Names = c("Chromosome", 
"Position", "P.value", "start.range", "end.range", "name"), class = "data.frame", row.names = c(NA, 
3L)) 

df2 <- structure(list(ensembl_gene_id = c("OS01G0281600", "OS01G0281400", 
"OS01G0281301", "OS01G0281200", "OS01G0281100", "OS01G0281000", 
"OS01G0280500", "OS01G0280400", "OS01G0280000", "OS01G0279900", 
"OS01G0279800", "OS01G0279700", "OS01G0279400", "OS01G0279300", 
"OS01G0279200", "OS01G0279100", "OS01G0279000", "OS01G0278900", 
"OS01G0278950", "OS02G0183000", "OS02G0182850", "OS02G0182900", 
"OS02G0182700", "OS02G0182800", "OS02G0182500", "OS02G0182300", 
"OS02G0181900", "OS02G0182100", "OS02G0181800", "OS02G0181400", 
"OS02G0180900", "OS02G0180700", "OS02G0180500", "OS02G0180200", 
"OS02G0180400", "OS02G0180100", "OS03G0640300", "OS03G0640400", 
"OS03G0640000", "OS03G0640100", "OS03G0639700", "OS03G0639800", 
"OS03G0639600", "OS03G0639400", "OS03G0639300", "OS03G0638900", 
"OS03G0639100", "OS03G0638400", "OS03G0638800", "OS03G0638300", 
"OS03G0638200"), chromosome_name = c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), start_position = c(10048273L, 
10021423L, 10019633L, 10011875L, 10008075L, 10003952L, 9967185L, 
9962807L, 9936850L, 9928971L, 9917593L, 9913390L, 9889550L, 9887657L, 
9878384L, 9874379L, 9866730L, 9859354L, 9863216L, 4639932L, 4629617L, 
4630446L, 4616832L, 4625425L, 4598883L, 4594375L, 4567630L, 4573831L, 
4563073L, 4551426L, 4521670L, 4497115L, 4486531L, 4460342L, 4481872L, 
4455016L, 24630180L, 24638186L, 24616417L, 24621460L, 24591421L, 
24596843L, 24574540L, 24564913L, 24544511L, 24487877L, 24514494L, 
24466606L, 24476060L, 24454477L, 24449135L), end_position = c(10050309L, 
10027120L, 10020376L, 10015468L, 10011595L, 10007742L, 9969073L, 
9966715L, 9947933L, 9935981L, 9921565L, 9917318L, 9902737L, 9889123L, 
9885517L, 9876678L, 9870864L, 9860677L, 9866617L, 4641686L, 4630180L, 
4634616L, 4621974L, 4628750L, 4601382L, 4595386L, 4573049L, 4578257L, 
4566597L, 4552860L, 4523668L, 4500124L, 4489409L, 4463571L, 4483470L, 
4457715L, 24634746L, 24641449L, 24617859L, 24629502L, 24596437L, 
24600376L, 24579212L, 24565726L, 24549550L, 24489307L, 24515219L, 
24473558L, 24480927L, 24457481L, 24453890L)), .Names = c("ensembl_gene_id", 
"chromosome_name", "start_position", "end_position"), class = "data.frame", row.names = c(NA, 
-51L)) 

回答

1

這是你想要的嗎?

ddply(df1, .(name), function(x) { 
df2[(x$Position - df2$start_position) < 100000 | 
    (x$Position - df2$end_position) < 100000, ] 
}) 
+0

這真的很有幫助,當我把它放在一個更大的套上時,它幾乎讓我感覺到了整個過程。我在這一點上得到了一些交叉匹配(在染色體之間),所以剛剛添加了一行以結束: ddply(df1,。(name),function(x){df2 [(x $ Position - df2 $ start_position)<100000 | (x $ Position - df2 $ end_position)<100000,] }) – MHtaylor