如何查找數據幀1的每一行給定的位置，在數據幀2來提取數據幀2

您好我有一個複雜的問題的信息，我想不出如何解決。如何查找數據幀1的每一行給定的位置，在數據幀2來提取數據幀2

我有兩個數據框，我想有條件地匹配一列df1到df2。

Df1: 
gene int start_pos end_pos tag 
A 1 233  422  a1 
A 2 622  766  a2 
A 3 1021  1211  ab 
A 4 1400  1500  b1 
A 4 2000  2200  b2 
B 1 122  233  a1 
B 2 332  665  a2 
C 1 199  433  a1 
C 2 776  899  a2 

df2: 
Gene type pos 
A  shrt 680 
A  long 1420 
B  shrt 350 
C  long 790

我想匹配這兩個表，根據'pos'信息。

我想：檢查DF2 POS（位置）每個基因，並發現它在DF1位置。例如df 2 A-680中的第一行，我想在df1中找到基因A，然後搜索位置680，並找到哪個'標記'對應於這個位置。

所以在最後我想一列添加到DF2的基礎上，從DF1標籤信息，像這樣：

df2: 
Gene type pos tag 
A  shrt 680 a2 
A  long 1420 b1 
B  shrt 350 a2 
C  long 790 a2

我找不到任何解決辦法做到這一點。合併不起作用，因爲我無法制作唯一的標識符。我找不到匹配功能的解決方案。

注：df1基本上是一個參考數據。 df 2中的所有位置都在df1的開始和結束之間。我想查找df1中每個位置的標籤信息。

我被卡住了。任何幫助都會很棒。

謝謝！

來源

2016-07-14 AGG

，你正面臨着加入的問題是列名是不一樣的（基因VS基因）。

這裏是你正在尋找的代碼：

library(dplyr) 
Df1 <- data.frame(gene = c("A", "A","A","A","A","B","B","C","C") , 
       int = c(1,2,3,4,4,1,2,1,2), 
       start_pos = c(233,622,1021,1400,2000,122,332,199,776), 
       end_pos = c(422,766,1211,1500,2200,233,665,433,899), 
       tag = c("a1", "a2","ab","b1","b2" , "a1","a2","a1","a2")) 


df2 <- data.frame(Gene = c("A","A","B","C"), 
       type = c("shrt", "long", "shrt", "long"), 
       pos = c(680,1420,350,790)) 


colnames(Df1)[1] <- "Gene"  ## matching the column name 

Merge_data <- inner_join(df2,Df1) 
filter_data <- filter(Merge_data, pos > start_pos & pos < end_pos) 

Result <- select(filter_data, c(Gene,type,pos,tag))

結果如下

Gene type pos tag 
1 A shrt 680 a2 
2 A long 1420 b1 
3 B shrt 350 a2 
4 C long 790 a2

來源

2016-07-14 19:50:07 Kou

謝謝！這解決了我的問題。我使用了兩個'list'而不是'dataframe'的解決方案，它工作（列表是因爲我爲df1和df2導入了兩個csv文件）。 – AGG

除了更改列的名稱，你可以使用by.x＆by.y，它告訴R來使用這兩列匹配 – Kou

另一個備受更模糊的方式做，這是創建data.frames的基於list()在每個基因位置的範圍上（這裏變量df2$pos，可能）。

這可以用一個for loop和子集的數據進行which()

首先設置新的變量在df2，並創建兩個工作表，命名爲list1和list2：

df2$tag <- NA 
list1 <- list() 
list2 <- list()

現在的for loop：

for (i in 1:nrow(df2)){ 
    # Use list1 to create a subset matching the genes 
    list1[[i]] <- na.omit(df1[which(df2$Gene[i] == df1$gene),]) 
    # Use list2 to create a subset where df2$pos is greater than or equal to df1$start_pos 
    list2[[i]] <- na.omit(list1[[i]][which(df2$pos[i] >= list1[[i]]$start_pos),]) 
    # Finally assign the 'tag' for df2$pos is less than o equal to df1$end_pos 
    df2$tag[i] <- as.character(list2[[i]][which(df2$pos[i] <= list2[[i]]$end_pos),"tag"]) 
}

而且我們只剩下：

Gene type pos tag 
1 A shrt 680 a2 
2 A long 1420 b1 
3 B shrt 350 a2 
4 C long 790 a2

只是給你另一種選擇！

來源

2016-07-14 20:31:02

如何查找數據幀1的每一行給定的位置，在數據幀2來提取數據幀2

回答

相關問題