2017-08-03 27 views
0

我有一個數據幀data含有基因組內的chromosome和突變的核苷酸的positionlapply上的數據幀的幾個子集

structure(list(chrom = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 
3L, 4L, 4L, 4L, 4L), pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 
20L, 33L, 40L, 45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", 
"pos"), class = "data.frame", row.names = c(NA, -14L)) 

chrom pos 
1  1 10 
2  1 200 
3  1 134 
4  1 400 
5  1 600 
6  1 1000 

而另一tss_locations,包含特徵的位置(tss)在genechromosome之內它現在在:

structure(list(gene = structure(c(1L, 4L, 5L, 6L, 7L, 8L, 9L, 
10L, 11L, 2L, 3L), .Label = c("gene1", "gene10", "gene11", "gene2", 
"gene3", "gene4", "gene5", "gene6", "gene7", "gene8", "gene9" 
), class = "factor"), chrom = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L), tss = c(5L, 10L, 23L, 1340L, 313L, 88L, 44L, 57L, 
88L, 74L, 127L)), .Names = c("gene", "chrom", "tss"), class = "data.frame", row.names = c(NA, 
-11L)) 

gene chrom tss 
1 gene1  1 5 
2 gene2  1 10 
3 gene3  1 23 
4 gene4  2 1340 
5 gene5  2 313 
6 gene6  2 88 

我試圖計算相同的染色體到最接近tss的距離爲data每個pos

到目前爲止,我能夠計算從每個data$pos任何tss_locations$tss(即最接近tss到每個pos不論染色體)的距離:

fun <- function(p) { 
    # Get index of nearest tss 
    index<-which.min(abs(tss_locations$tss - p)) 
    # Lookup the value 
    closestTss<-tss_locations$tss[[index]] 
    # Calculate the distance 
    dist<-(closestTss-p) 
    list(snp=p, closest=closestTss, distance2nearest=dist) 
} 

# Run function for each 'pos' in data 
dist2tss<-lapply(data$pos, fun) 

# Convert to data frame and sort descending: 
dist2tss<-do.call(rbind, dist2tss) 
dist2tss<-as.data.frame(dist2tss) 

dist2tss<-arrange(dist2tss,(as.numeric(distance2nearest))) 
dist2tss$distance2nearest<-as.numeric(dist2tss$distance2nearest) 

head(dist2tss) 

    snp closest distance2nearest 
1 600  313    -287 
2 400  313    -87 
3 200  127    -73 
4 100  88    -12 
5 33  23    -10 
6 134  127    -7 

但是,我喜歡能夠在相同的染色體上找到最接近的tss對於每個pos。我知道我可以將它應用到每個染色體上,但我希望看到全局距離(跨所有染色體),但只能在同一染色體上比較postss

我該如何調整以實現此目標?通過染色體對兩個數據幀進行子集併合並結果?

到目前爲止,這是否正確?

回答

1

像這樣的東西可能會工作,以獲得在data數據幀中每個染色體最接近的tss。

data <- structure(list(chrom = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 
           3L, 4L, 4L, 4L, 4L), pos = c(10L, 200L, 134L, 400L, 600L, 1000L, 
                   20L, 33L, 40L, 45L, 50L, 55L, 100L, 123L)), .Names = c("chrom", 
                                "pos"), class = "data.frame", row.names = c(NA, -14L)) 

tss_locations <- structure(list(gene = structure(c(1L, 4L, 5L, 6L, 7L, 8L, 9L, 
                10L, 11L, 2L, 3L), .Label = c("gene1", "gene10", "gene11", "gene2", 
                       "gene3", "gene4", "gene5", "gene6", "gene7", "gene8", "gene9" 
               ), class = "factor"), chrom = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
                        3L, 4L, 4L), tss = c(5L, 10L, 23L, 1340L, 313L, 88L, 44L, 57L, 
                             88L, 74L, 127L)), .Names = c("gene", "chrom", "tss"), class = "data.frame", row.names = c(NA, 
                                                   -11L)) 

# Generate needed values by applying function to all rows and transposing t() the results 
data[,c("closest_gene", "closest_tss", "min_dist")] <- t(apply(data, 1, function(x){ 
    # Get subset of tss_locations where the chromosome matches the current row 
    genes <- tss_locations[tss_locations$chrom == x["chrom"], ] 

    # Find the minimum distance from the current row's pos to the nearest tss location 
    min.dist <- min(abs(genes$tss - x["pos"])) 

    # Find the closest tss location to the current row's pos 
    closest_tss <- genes[which.min(abs(genes$tss - x["pos"])), "tss"] 

    # Check if closest tss location is less than pos and set min.dist to negative if true 
    min.dist <- ifelse(closest_tss < x["pos"], min.dist * -1, min.dist) 

    # Find the closest gene to the current row's pos 
    closest_gene <- as.character(genes[which.min(abs(genes$tss - x["pos"])), "gene"]) 

    # Return the values to the matrix 
    return(c(closest_gene, closest_tss, min.dist)) 
})) 
+0

謝謝 - 看起來很接近但輸出略有偏差。例如。在運行時,第一行數據看起來像:'1 1 10 gene1 5 0' - 最接近的'tss'是'5',但'min_dist'是'0' – fugu

+0

你說得對,我'已經編輯了答案以糾正這個問題。 –

+0

太棒了!我在問題中提供的數據是玩具數據。我的真實數據是一樣的,除了'數據'有一些額外的列。我怎樣才能調整,以適應(並明確選擇「pos」列)? – fugu