從蛋白質數據庫到Cosmic或Uniprot的蛋白質序列比對

我想將蛋白質數據庫中的PDB文件與Cosmic或Uniprot中顯示的蛋白質的典型AA序列相匹配。具體來說，我需要做的是從pdb文件中拉出主幹中的碳原子和它們的xyz位置。我還需要在蛋白質序列中提取他們的實際順序。對於結構3GFT（克拉斯 - Uniprot登錄號P01116），這很容易，我可以採用ResSeq編號。但是，對於其他一些蛋白質，我無法弄清楚這是可能的。例如，對於結構（2ZHQ）（蛋白質F2- Uniprot登錄號P00734），Seqres具有針對數字「1」和「14」重複的ResSeq數字，並且僅在Icode條目中有所不同。進一步icode條目不是在詞法順序，因此很難說出要提取的順序。從蛋白質數據庫到Cosmic或Uniprot的蛋白質序列比對

如果考慮結構3V5Q（Uniprot登錄號Q16288），情況會更糟糕。對於大多數蛋白質，ResSeq編號與來自COSMIC或UNIPROT等來源的實際氨基酸相匹配。在位置711之後，它跳到位置730.當查看REMARK 465（缺失的原子）時，它顯示對於鏈A，缺失了726-729。然而，匹配到蛋白質後，那些AA實際上是712-715。

我附上了一些簡單的3GFT示例代碼，但如果有人是pdb文件的專家，並且可以幫助我解決其餘的問題，我會非常感激。

library(gdata) 

#answer<- get.positions("http://www.pdb.org/pdb/files/2ZHQ.pdb","L") 
answer<- get.positions("http://www.pdb.org/pdb/files/3GFT.pdb","A") 


#This function reads a pdb file and returns the appropriate data structure 
get.positions <- function(sourcefile, chain_required = "A"){ 
N <- 10^5 
AACount <- 0 
positions = data.frame(Residue=rep(NA, N),AtomCount=rep(0, N),SideChain=rep(NA, N),XCoord=rep(0, N),YCoord=rep(0, N),ZCoord=rep(0, N),stringsAsFactors=FALSE)  

visited = list() 
filedata <- readLines(sourcefile, n= -1) 
for(i in 1: length(filedata)){ 
input = filedata[i] 
id = substr(input,1,4) 
if(id == "ATOM"){ 
    type = substr(input,14,15) 
    if(type == "CA"){ 
    resSerial = strtoi(substr(input, 7,11)) 
    residue = substr(input,18,20) 
    type_of_chain = substr(input,22,22) 
    resSeq = strtoi(substr(input, 23,26)) 
    altLoc = substr(input,17,17) 

    if(resSeq >=1){ #does not include negative residues 
     if(type_of_chain == chain_required && !(resSerial %in% visited) && (altLoc == " " || altLoc == "A")){ 
     visited <- c(visited, resSerial) 
     AACount <- AACount + 1 
     position_string =list() 
     position_string[[1]]= as.numeric(substr(input,31,38)) 
     position_string[[2]] =as.numeric(substr(input,39,46)) 
     position_string[[3]] =as.numeric(substr(input,47,54)) 
     #print(input) 
     positions[AACount,]<- c(residue, resSeq, type_of_chain, position_string[[1]], position_string[[2]], position_string[[3]]) 
     } 
    } 
    } 
} 

    } 
    positions<-positions[1:AACount,] 
    positions[,2]<- as.numeric(positions[,2]) 
    positions[,4]<- as.numeric(positions[,4]) 
    positions[,5]<- as.numeric(positions[,5]) 
    positions[,6]<- as.numeric(positions[,6]) 
    return (positions) 
}

來源

2012-05-28 user1357015

寫作時你可能想這個問題轉移到www.biostars.org和寫入[email protected]（你知道，這些序列在數據庫級別就足夠了聯繫？）在任何情況下， [email protected]請求Jules Jacobsen，因爲他是UniProt常駐專家，負責將PDB結構鏈接到uniprot.org規範序列。

來源

2012-05-29 16:34:23 Jerven

嗨，感謝您的建議！ – user1357015

我已經在biostars上發佈了這個問題，謝謝！我還通過電子郵件發送了PDB。與一位教授談話雖然專門處理蛋白質結構，但他提到最好的方法是做一個對齊。不過，謝謝你的幫助！ – user1357015

這是一種方法。它需要bio3d R軟件包（http://thegrantlab.org/bio3d/），並安裝肌肉對齊可執行文件。我按照這裏的'其他實用程序'的說明http://thegrantlab.org/bio3d/tutorials/installing-bio3d

下面的示例代碼似乎適用於您列出的三種情況。

library(bio3d) 

## Download PDB file with given 'id' (can also read from online) 
id <- "3GFT" #"3V5Q" 
file.name <- get.pdb(id) 
pdb <- read.pdb(file.name) 
pdb.seq <- pdbseq(pdb, atom.select(pdb, chain="A", elety="CA")) 

## Get UniProt identifier and sequence 
pdb.ano <- pdb.annotate(id, "db_id") 
uni.seq <- get.seq(pdb.ano) 

## Align sequences to define corespondences 
aln <- seqaln(seqbind(pdb.seq, uni.seq), id=c(file.name, unlist(pdb.ano))) 

## Read aligned coordinate data (all the info you want is in here) 
pdbs <- read.fasta.pdb(aln) 

answer2 <- cbind(1:ncol(pdbs$ali), t(pdbs$ali), 
      pdbs$resno[1,], matrix(pdbs$xyz[1,], ncol=3, byrow=T)) 

head(answer2) 

[1,] "1" "M"  "M" "1" "62.935" "97.579" "30.223" 
[2,] "2" "T"  "T" "2" "63.155" "95.525" "27.079" 
[3,] "3" "E"  "E" "3" "65.289" "96.895" "24.308" 
[4,] "4" "Y"  "Y" "4" "64.899" "96.22" "20.615" 
[5,] "5" "K"  "K" "5" "67.593" "96.715" "18.023" 
[6,] "6" "L"  "L" "6" "65.898" "97.863" "14.816"

，如果你希望你的氨基酸3字母代碼上市有一個在bio3d一個aa321（）函數。

來源

2014-05-15 01:30:16 baz

我低估了這個答案，因爲雖然技術上代碼明智，但它是健全的。我認爲它忽略了一些可能困擾這種對比的問題，這在嘗試發現新的生物學知識時很重要。例如http://www.ebi.ac.uk/pdbe/widgets/unipdb?uniprot=P01116可以很好地概述現有的對齊方式。 – Jerven

從蛋白質數據庫到Cosmic或Uniprot的蛋白質序列比對

回答

相關問題