2017-09-13 53 views
-1

如何在空白/缺失值時使下面的數據框行獨一無二地取決於第二列?如何使用帶空白/缺失值的獨特功能

> head(interproscan) 
       V1  V14 
1 sp0000001-mRNA-1   
2 sp0000001-mRNA-1   
3 sp0000001-mRNA-1   
4 sp0000005-mRNA-1 GO:0003723 
5 sp0000006-mRNA-1 GO:0016021 
6 sp0000006-mRNA-1 GO:0016021 


> head(unique(interproscan[ , 1:2])) 
       V1        V14 
1 sp0000001-mRNA-1         
4 sp0000005-mRNA-1      GO:0003723 
5 sp0000006-mRNA-1      GO:0016021 
7 sp0000006-mRNA-2      GO:0016021 
9 sp0000006-mRNA-3      GO:0016021 

目標將是:

    V1        V14 
1 sp0000001-mRNA-1         
4 sp0000005-mRNA-1      GO:0003723 
5 sp0000006-mRNA-1      GO:0016021 

預先感謝您

+0

'庫(tidyverse); interproscan%>%distinct(V14,.keep_all = T)'適合你的目標。還有別的嗎? – Tunn

+0

'庫(tidyverse); > InterProScan的%>%不同(V14,.keep_all = T) V1 V14 1:sp0000001體mRNA-1 NA >頭(InterProScan的) V1 V14 1:sp0000001體mRNA-1 NA 2:sp0000001體mRNA -1 NA 3:sp0000001-mRNA-1 NA 4:sp0000005-mRNA-1 NA 5:sp0000006-mRNA-1 NA 6:sp0000006-mRNA-1 NA' – user977828

回答

0

嘗試對一個數據幀或數據表:

interproscan <- data.frame(interproscan) 

unique(interproscan) 

輸出:

   V1  V14 
1 sp0000001-mRNA-1   
4 sp0000005-mRNA-1 GO:0003723 
5 sp0000006-mRNA-1 GO:0016021 

樣本數據:

require(data.table) 
interproscan <- fread("V1,    V14 
         sp0000001-mRNA-1,   
         sp0000001-mRNA-1,   
         sp0000001-mRNA-1,    
         sp0000005-mRNA-1, GO:0003723 
         sp0000006-mRNA-1, GO:0016021 
         sp0000006-mRNA-1, GO:0016021") 
1

您需要通過它來修改V1到組你打算的方式。我用gsub丟棄最後的-number後綴。

library(dplyr) 
ans <- df %>% 
     group_by(gsub("-\\d","",V1), V14) %>% # now it groups the way you want 
     arrange(V1) %>% # unnecessary for your toy example but just in case for your full data 
     slice(1) %>%  # select top row-entry 
     ungroup() %>% 
     select(-4)  # discard intermediate grouping variable 

輸出

# A tibble: 3 x 3 
    id    V1  V14 
    <int>   <chr>  <chr> 
1  1 sp0000001-mRNA-1   
2  4 sp0000005-mRNA-1 GO:0003723 
3  5 sp0000006-mRNA-1 GO:0016021 

數據

df <- structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L), V1 = c("sp0000001-mRNA-1", 
"sp0000001-mRNA-1", "sp0000001-mRNA-1", "sp0000005-mRNA-1", "sp0000006-mRNA-1", 
"sp0000006-mRNA-1", "sp0000006-mRNA-2", "sp0000006-mRNA-3"), 
    V14 = c("", "", "", "GO:0003723", "GO:0016021", "GO:0016021", 
    "GO:0016021", "GO:0016021")), class = "data.frame", .Names = c("id", 
"V1", "V14"), row.names = c(NA, -8L)) 


    id    V1  V14 
1 1 sp0000001-mRNA-1   
2 2 sp0000001-mRNA-1   
3 3 sp0000001-mRNA-1   
4 4 sp0000005-mRNA-1 GO:0003723 
5 5 sp0000006-mRNA-1 GO:0016021 
6 6 sp0000006-mRNA-1 GO:0016021 
7 7 sp0000006-mRNA-2 GO:0016021 
8 9 sp0000006-mRNA-3 GO:0016021