2013-06-06 22 views
-1

我有一個蛋白質及其相互作用的列表,我有興趣瞭解不同蛋白質中共享相互作用的百分比。從非對稱矩陣中選擇元素

我的蛋白質和交互的名單看起來是這樣的:

head(lista) 
$`A1CF ` 
[1] " A1CF" " APOBEC1" " CUGBP2" " KHSRP" " SYNCRIP" " TNPO2" 

$`A2LD1 ` 
[1] " A2LD1" " PRPSAP2" " RPL15" " TANC1" 

$`A2M ` 
[1] " A2M"  " ADAM19" " ADAMTS1" " AMBP"  " ANXA6" " APOE"  " APP"  " B2M"  " C11orf58" " CELA1" " CPB2"  " CTSB"  " CTSE"  
[14] " F2"  " HSPA5" " IL10"  " IL1B"  " KLK13" " KLK2"  " KLK3"  " KLK5"  " KLKB1" " LCAT"  " LEP"  " LRP1"  " MMP2"  
[27] " MYOC"  " NGF"  " PAEP"  " PDGFA" " PDGFB" " PLG"  " SERPINA1" " SHBG"  " SPACA3" " TGFBI" 

$`AAAS ` 
[1] " AAAS" " ARHGAP1" " BANF1" " CCNG2" " EP300" " HMGA1" " KPNB1" " NUP107" " NUP133" " NUP153" " NUP155" " NUP160" " NUP188" " NUP205" 
[15] " NUP210" " NUP214" " NUP35" " NUP37" " NUP43" " NUP50" " NUP54" " NUP62" " NUP85" " NUP88" " NUP93" " NUP98" " NUPL1" " NUPL2" 
[29] " PLK4" " POM121C" " PSIP1" " RAE1" " RAN"  " RANBP2" " SEH1L" " TARDBP" " TPR"  " TTK"  " XPO1" 

$`AAGAB ` 
[1] " AAGAB" " AFTPH" " EIF3C" " UNC119" 

$`AAK1 ` 
[1] " AAK1"  " ACOX3" " ADAM28" " ALPK3" " AURKB" " AZI2"  " BMP2K" " CABC1" " CAMK2G" " DCK"  " DCTPP1" " EIF2AK1" " FAM83A" 
[14] " FER"  " FRYL"  " GAPVD1" " GFPT1" " HIPK1" " JAK1"  " KIAA0195" " KIAA0528" " LIMK2" " LSM14A" " MAP4K2" " MAP4K5" " MAPK6" 
[27] " NEK11" " NQO2"  " NUMB"  " PDE4A" " PIP4K2C" " PKN3"  " PRKAA1" " PTPN18" " SIK2"  " SIK3"  " SPEG"  " TAOK1" " TAOK3" 
[40] " TBK1"  " TBKBP1" " TESK2" " TMX1"  " TNK1"  " ZAK" 

爲了得到我所做的蛋白質之間共享作用因子的百分比如下:

我創建了一個矩陣,尺寸相等到的lista

M=matrix(); 
length(M) = 9794^2; 
dim(M) = c(9794, 9794); 

#A function to calculate the interactors shared among proteins 
dFun3 <- function(x,y){length(which(x%in%y))/length(x)}; 

#To create a matrix with percentage of intereactors shared among proteins (note that the matrix is non-symmentric, being AxB different from BxA, with A and B being proteins) 

for (i in 1:length(lista)) 
{ 
    for (j in 1:length(lista)) 
    { 
     k = dFun3(lista[[i]], lista[[j]]) 
     M[i,j] = k; 
    } 
} 

長度現在我之間AxB和顯示比較的矩陣。我現在想要做的是比較來自蛋白質i和來自蛋白質j的值,想法是比較AxBBxA,並且如果AxB is > 0.7BxA < 0.7去除A蛋白質。我的方法是做一個這樣的循環:

for (i in 1:nrow(M)) 
{ 
    for (j in 1:ncol(M)) 
    { 
     if (x[i,] > 0.7 & x[,j] < 0.7) {x[i,] <- "-1"} 
     if (x[,j] > 0.7 & x[i,] <0.7) {x[,j] <- "+1"} 
    } 
} 

用這種方法我假裝在+1和-1比較中刪除蛋白質。

儘管如此,這種方法需要很長時間...任何建議都將非常成功。

感謝

回答

2

貌似是combn + intersect是較好的選擇。嘗試此例如:

combn(seq_along(lista),2,function(x) 
     length(intersect(lista[[x[1]]],lista[[x[2]]]))/length(lista[[x[1]]])) 

[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## gives all zeros here since 
            ## no intersection in your example 

實際上combn將生成所有可能的索引的組合然後給他們爲一對索引來測試相交的功能。

combn(seq_along(lista),2) 
    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] 
[1,] 1 1 1 1 1 2 2 2 2  3  3  3  4  4  5 
[2,] 2 3 4 5 6 3 4 5 6  4  5  6  5  6  6 

編輯添加LISTA的dput因爲OP不給重複的例子:

dput(lista) 
structure(list(A1CF = c(" A1CF", " AURKB", " CUGBP2", " KHSRP", 
" SYNCRIP", " TNPO2"), A2LD1 = c(" A2LD1", " PRPSAP2", " RPL15", 
" TANC1"), A2M = c(" A2M", " ADAM19", " ADAMTS1", " AMBP", " ANXA6", 
" APOE", " APP", ",B2M", " C11orf58", " CELA1", " CPB2", " CTSB", 
" CTSE", " F2", " HSPA5", " IL10", " IL1B", " KLK13", " KLK2", 
" KLK3", " KLK5", " KLKB1", " LCAT", " LEP", " LRP1", " MMP2", 
" MYOC", " NGF", " PAEP", " PDGFA", " PDGFB", " PLG", " SERPINA1", 
" SHBG", " SPACA3", " TGFBI"), AAAS = c(" AAAS", " ARHGAP1", 
" BANF1", " CCNG2", " EP300", " HMGA1", " KPNB1", " NUP107", 
" NUP133", " NUP153", " NUP155", " NUP160", " NUP188", " NUP205", 
" NUP210", " NUP214", " NUP35", " NUP37", " NUP43", " NUP50", 
" NUP54", " NUP62", " NUP85", " NUP88", " NUP93", " NUP98", " NUPL1", 
" NUPL2", " PLK4", " POM121C", " PSIP1", " RAE1", " RAN", " RANBP2", 
" SEH1L", " TARDBP", " TPR", " TTK", " XPO1"), AAGAB = c(" AAGAB", 
" AFTPH", " EIF3C", " UNC119"), AAK1 = c(" AAK1", " ACOX3", " ADAM28", 
" ALPK3", " AURKB", " AZI2", " BMP2K", " CABC1", " CAMK2G", " DCK", 
" DCTPP1", " EIF2AK1", " FAM83A", " FER", " FRYL", " GAPVD1", 
" GFPT1", " HIPK1", " JAK1", " KIAA0195", " KIAA0528", " LIMK2", 
" LSM14A", " MAP4K2", " MAP4K5", " MAPK6", " NEK11", " NQO2", 
" NUMB", " PDE4A", " PIP4K2C", " PKN3", " PRKAA1", " PTPN18", 
" SIK2", " SIK3", " SPEG", " TAOK1", " TAOK3", " TBK1", " TBKBP1", 
" TESK2", " TMX1", " TNK1", " ZAK")), .Names = c("A1CF", "A2LD1", 
"A2M", "AAAS", "AAGAB", "AAK1")) 

編輯

爲了尋找2行之間的比較row1在矩陣中,你可以像這樣改變功能:

ll <- combn(seq_along(lista),2,FUN=function(x){ 
    ratio <- length(intersect(lista[[x[1]]],lista[[x[2]]]))/ 
     c(length(lista[[x[1]]]),length(lista[[x[2]]])) 
    res <- NA        ## value to return by default 
    if (ratio[1] > 0.7 & ratio[2] < 0.7) 
     res <- x[[1]]      ## return the index of the first protein 
    if (ratio[2] > 0.7 & ratio[1] < 0.7) 
     res <- x[[2]]      ## return the index of the second protein 
    res 
}) 
## to get the list of proteins to removed 
names(lista)[ll[!is.na(ll)]] 
## to remove the proteins form the origin list 
lista[!names(lista) %in% names(lista)[ll[!is.na(ll)]]] 

也許你還應該從ll列表中刪除重複。

FYI 47956321 =選擇(9794,2)數量的組合....

+0

@noah我不」認爲這樣的命令存在。我使用Rstudio處理數據。 1-複製粘貼歷史數據2 - 使用Ctrl + F替換或插入元素之間的逗號。 (一些正則表達式)。格式化這比回答操作難點更難:)。 – agstudy

+0

@noah,你說得對,我還在矩陣 – user2380782

+0

@noah謝謝比較row2和row1。我編輯我的答案。 – agstudy