2015-11-03 43 views
0

我想崩潰以下數據框數據框崩潰根據行數,計算加權平均值[R

df 

Chromosome Start End lengthMB imba log2 Cn mCn Cn_ 
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99 
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2 
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91 
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01 
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89 
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87 
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57 
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87 
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01 
chr3 0 8040000 1.479 NA 0.000969256 2 1 2 
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89 
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78 
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92 
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04 
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91 
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02 
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92 

通過匹配只有具有相同的道道和MC值我要崩潰的行連續行。例如,對於第4行,我們有以下幾點:

Chromosome Start End lengthMB imba log2 Cn mCn Cn_ 
chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99 
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2 
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91 
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01 

我要崩潰了,他們有相同的道道和MC得分連續行,所以第一個三排,每有一個「2」和分別在Cn和mCn列上顯示「1」,還要更改End列以反映該崩潰。

Chromosome Start End lengthMB imba log2 Cn mCn Cn_ 
chr1 0 9598904 8.023 0.026905119 -0.001671481 2 1 1.99 

但我也想改變Cn_column,使其在lengthMB得分是什麼該行的加權平均Cn_dependant。所以對於前三行的計算爲:

((8.023/9.599) * 1.99) + ((1.145/9.599) * 2) + ((0.431/9.599) * 1.91) = 1.987 

輸出的前四個獨特的染色體:

Chromosome Start End lengthMB imba log2 Cn mCn Cn_ 
chr1 0 9598904 8.023 0.026905119 -0.001671481 2 1 1.99 
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01 
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89 
chr2 8022930 9598904 1.145 0.030441784 0.000601976 2 1 1.79 
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87 
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01 
chr3 0 8040000 1.479 NA 0.000969256 2 1 2 
chr3 8040000 9598904 8.185 0.033499045 -0.031338811 1 0 0.836 
chr3 9598904 35402000 0.883 0.049003807 -0.021413391 2 1 2.02 
chr4 0 9598904 11.065 0.035092332 -0.022844471 2 1 2 

嘗試過這樣的事情,但我也不知道如何將計算...

squish_segments <- function(sample) { 
    setDT(sample)[, .ind:= cumsum(c(TRUE,Start[-1]!=End[-.N])), 
    list(lengthMB, probes, snps, imba, log2, Cn, mCn, Cn_)][, 
    list(Chr=Chromosome[1], Start=Start[1], End=End[.N]), 
    list(lengthMB, probes, snps, imba, log2, Cn, mCn, Cn_, .ind)][,.ind:=NULL][] 
} 
+0

如何'((8.023/9.599)* 1.99)+((1.145/9.599)* 2)+((0.431/9.599)* 1.91)= 1.9435'?我發現'1.987601'。另外,當您摺疊這些行時,您希望爲包含不同信息的列保留哪些值?例如。 'Start','End','imba','log2'。 – AntoniosK

+0

我想保留所有列 – user3324491

+0

我知道你想保留它們,但是你沒有指定你想要的值。例如,'Cn = 2'和'mCn = 1'的'chr1'最初有3行,所以在這些列中有3個不同的值。此外,在你想要的輸出中,你有一個染色體多次具有相同的「Cn」和「mCn」。檢查'chr2'的第4和6行以及'chr3'的第7和第9行。看起來你不會因爲某些原因而崩潰。 – AntoniosK

回答

1

首先,請提供您的數據集的dput輸出你的問題更具有可重複性。

我認爲這是你想要的低級別。

setkey(df, Chromosome, Cn, mCn, Start) 

df[, list(
    Start=min(Start), 
    End=max(End), 
    lengthMB=lengthMB[1], 
    imba=imba[1], 
    log2=log2[1], 
    Cn_=weighted.mean(Cn_, lengthMB) 
), keyby=list(Chromosome, Cn , mCn)] 
0

可以識別唯一的「事件」(具有相同Cn和mCn分數的連續行),然後簡單地遍歷這些事件並相應地修改行。效率不高,但應該完成這項工作。

txt <- "Chromosome Start End lengthMB imba log2 Cn mCn Cn_ 
chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2 
chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91 
chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01 
chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89 
chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87 
chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57 
chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87 
chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01 
chr3 0 8040000 1.479 NA 0.000969256 2 1 2 
chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89 
chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78 
chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92 
chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04 
chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91 
chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02 
chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92" 

df <- read.table(text=txt, header=T) 

#identify each unique event 
df$eventid <- with(df, cumsum(c(1,diff(as.numeric(factor(Chromosome)))!=0 | diff(Cn)!=0 | diff(mCn)!=0))) 

#loop through events 
for(i in 1:max(df$eventid)){ 
    #identify rows in df with ith event 
    rows.i <- which(df$eventid == i) 

    df[rows.i,] <- within(df[rows.i,],{ 
     #calculate values of interest and assign to first row of event 
     Start[1] <- min(Start) 
     End[1] <- max(End) 
     Cn_[1] <- sum((lengthMB/sum(lengthMB))*Cn_) 
     lengthMB[1] <- sum(lengthMB)  
    }) 

    #drop all but first row 
    if(length(rows.i) > 1) df <- df[-rows.i[-1],] 

} #end i 

結果

> df 
    Chromosome Start  End lengthMB  imba   log2 Cn mCn  Cn_ eventid 
1  chr1 8022945 9598904 1.576 0.03044178 0.000601976 2 1 1.9753871  1 
3  chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.0100000  2 
4  chr2  0 8022930 8.023 0.02690512 -0.001671481 3 1 2.8900000  3 
5  chr2 8022930 9598904 1.576 0.03044178 0.000601976 2 1 1.7879569  4 
7  chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.8700000  5 
8  chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.0100000  6 
9  chr3  0 8040000 1.479   NA 0.000969256 2 1 2.0000000  7 
10  chr3 8040000 9598904 12.137 0.03349904 -0.031338811 1 0 0.8541823  8 
12  chr3 9598904 35402000 4.978 0.04900381 -0.021413391 2 1 2.0187143  9 
14  chr4  0 9598904 52.245 0.03509233 -0.022844471 2 1 1.9956599  10 
1

這是一個dplyr方法。

library(dplyr) 

df = read.table(text= 
        "Chromosome Start End lengthMB imba log2 Cn mCn Cn_ 
       chr1 0 8022945 8.023 0.026905119 -0.001671481 2 1 1.99 
       chr1 8022945 9168284 1.145 0.030441784 0.000601976 2 1 2 
       chr1 9168284 9598904 0.431 NA -0.024952441 2 1 1.91 
       chr1 9598904 31392788 21.794 0.036011994 0.002151497 3 1 3.01 
       chr2 0 8022930 8.023 0.026905119 -0.001671481 3 1 2.89 
       chr2 8022930 9168284 1.145 0.030441784 0.000601976 2 1 1.87 
       chr2 9168284 9598904 0.431 NA -0.024952441 2 1 1.57 
       chr2 9598904 31392788 21.794 0.036011994 0.002151497 2 0 1.87 
       chr2 31392788 35402000 1.164 0.029733771 0.003149921 2 1 2.01 
       chr3 0 8040000 1.479 NA 0.000969256 2 1 2 
       chr3 8040000 9168284 8.185 0.033499045 -0.031338811 1 0 0.89 
       chr3 9168284 9598904 3.952 0.036792754 0.002847936 1 0 0.78 
       chr3 9598904 31392788 0.883 0.049003807 -0.021413391 2 1 1.92 
       chr3 31392788 35402000 4.095 0.037653564 0.011944688 2 1 2.04 
       chr4 0 8022930 11.065 0.035092332 -0.022844471 2 1 1.91 
       chr4 8022930 9168284 40.635 0.037690844 0.006703603 2 1 2.02 
       chr4 9168284 9598904 0.545 0.047435696 -0.021068024 2 1 1.92", header=T) 


df %>% 
mutate(Consec = ifelse(Chromosome == dplyr::lag(Chromosome, default = Chromosome[1]) & ## flag consecutive matching chromosomes 
         Cn == dplyr::lag(Cn, default = Cn[1]) & 
         mCn == dplyr::lag(mCn, default = mCn[1]), 0, 1), 
     Consec = cumsum(Consec)) %>%  ## create an id for consecutive matching chromosomes 
group_by(Chromosome, Cn, mCn, Consec) %>% 
summarize(Cn_ = sum(lengthMB * Cn_)/sum(lengthMB), 
      Start = min(Start), 
      End = max(End), 
      lengthMB = first(lengthMB), 
      imba= first(imba), 
      log2= first(log2)) %>% 
ungroup() %>% ## only if you want to ungroup 
select(Chromosome,Start,End, lengthMB,imba,log2,Cn,mCn,Cn_) %>% ## to re arrange column order 
arrange(Chromosome, Start) 


# Chromosome Start  End lengthMB  imba   log2 Cn mCn  Cn_ 
#  (fctr) (int) (int) (dbl)  (dbl)  (dbl) (int) (int)  (dbl) 
# 1  chr1  0 9598904 8.023 0.02690512 -0.001671481  2  1 1.9876008 
# 2  chr1 9598904 31392788 21.794 0.03601199 0.002151497  3  1 3.0100000 
# 3  chr2  0 8022930 8.023 0.02690512 -0.001671481  3  1 2.8900000 
# 4  chr2 8022930 9598904 1.145 0.03044178 0.000601976  2  1 1.7879569 
# 5  chr2 9598904 31392788 21.794 0.03601199 0.002151497  2  0 1.8700000 
# 6  chr2 31392788 35402000 1.164 0.02973377 0.003149921  2  1 2.0100000 
# 7  chr3  0 8040000 1.479   NA 0.000969256  2  1 2.0000000 
# 8  chr3 8040000 9598904 8.185 0.03349904 -0.031338811  1  0 0.8541823 
# 9  chr3 9598904 35402000 0.883 0.04900381 -0.021413391  2  1 2.0187143 
# 10  chr4  0 9598904 11.065 0.03509233 -0.022844471  2  1 1.9956599 

注意lagdplyr功能,也是一個stats包功能。我必須寫dplyr::lag否則當我嘗試在lag內指定default =時會出現衝突。我不知道你或其他人是否可以複製此問題。

+0

感謝編輯@bramtayl。 – AntoniosK

+0

嘿,我已經包括了接下來的兩個獨特的染色體值的輸出,因爲這不會給我很多我想要的輸出!謝謝 – user3324491

+0

我會根據您提供的新數據再次檢查我的過程。但是,無論我多努力,我都不會讓'((8.023/9.599)* 1.99)+((1.145/9.599)* 2)+((0.431/9.599)* 1.91)'等於1.9435 '。 – AntoniosK

0

如果我正確理解你的問題,你可以用一行data.table快速分組來完成。

library(data.table) 
dt[, Cn_dependent := sum((lengthMB/sum(lengthMB)) * Cn_), 
    by = .(Chromosome, Cn, mCn)] 

得到這個:

> dt 
    Chromosome Start  End lengthMB  imba   log2 Cn mCn Cn_ Cn_dependent 
1:  chr1  0 8022945 8.023 0.02690512 -0.001671481 2 1 1.99  1.987601 
2:  chr1 8022945 9168284 1.145 0.03044178 0.000601976 2 1 2.00  1.987601 
3:  chr1 9168284 9598904 0.431   NA -0.024952441 2 1 1.91  1.987601 
4:  chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.01  3.010000 
5:  chr2  0 8022930 8.023 0.02690512 -0.001671481 3 1 2.89  2.890000 
6:  chr2 8022930 9168284 1.145 0.03044178 0.000601976 2 1 1.87  1.882285 
7:  chr2 9168284 9598904 0.431   NA -0.024952441 2 1 1.57  1.882285 
8:  chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.87  1.870000 
9:  chr2 31392788 35402000 1.164 0.02973377 0.003149921 2 1 2.01  1.882285 

要摺疊由ChromosomeCnmCn,你可以使用鍵和unique

> setkey(dt, "Chromosome", "Cn", "mCn") 
> unique(dt) 
    Chromosome Start  End lengthMB  imba   log2 Cn mCn Cn_ Cn_dependent 
1:  chr1  0 8022945 8.023 0.02690512 -0.001671481 2 1 1.99  1.987601 
2:  chr1 9598904 31392788 21.794 0.03601199 0.002151497 3 1 3.01  3.010000 
3:  chr2 9598904 31392788 21.794 0.03601199 0.002151497 2 0 1.87  1.870000 
4:  chr2 8022930 9168284 1.145 0.03044178 0.000601976 2 1 1.87  1.882285 
5:  chr2  0 8022930 8.023 0.02690512 -0.001671481 3 1 2.89  2.890000 

這裏是data.table我開始用的dput

> dput(dt) 
structure(list(Chromosome = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L), .Label = c("chr1", "chr2"), class = "factor"), Start = c(0L, 
8022945L, 9168284L, 9598904L, 0L, 8022930L, 9168284L, 9598904L, 
31392788L), End = c(8022945L, 9168284L, 9598904L, 31392788L, 
8022930L, 9168284L, 9598904L, 31392788L, 35402000L), lengthMB = c(8.023, 
1.145, 0.431, 21.794, 8.023, 1.145, 0.431, 21.794, 1.164), imba = c(0.026905119, 
0.030441784, NA, 0.036011994, 0.026905119, 0.030441784, NA, 0.036011994, 
0.029733771), log2 = c(-0.001671481, 0.000601976, -0.024952441, 
0.002151497, -0.001671481, 0.000601976, -0.024952441, 0.002151497, 
0.003149921), Cn = c(2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 2L), mCn = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L), Cn_ = c(1.99, 2, 1.91, 3.01, 
2.89, 1.87, 1.57, 1.87, 2.01)), .Names = c("Chromosome", "Start", 
"End", "lengthMB", "imba", "log2", "Cn", "mCn", "Cn_"), class = c("data.table", 
"data.frame"), row.names = c(NA, -9L), .internal.selfref = <pointer: 0x26abf68>)