只是一個快速的警告:一個別名可以映射到多個Entrez基因ID。
因此,您當前的解決方案假定第一個列出的ID是正確的(這可能不是正確的)。
# e.g. The alias "A1B" is assumed to map to "1" and not "6641"
mget("A1B", org.Hs.egALIAS2EG)
# $A1B
# [1] "1" "6641"
如果您?org.Hs.egALIAS2EG
退房的幫助下,你會看到,這是從來沒有建議使用別名或符號作爲主要基因標識。
## From the 'Details' section of the help:
# Since gene symbols are sometimes redundantly assigned in the literature,
# users are cautioned that this map may produce multiple matching results
# for a single gene symbol. Users should map back from the entrez gene IDs
# produced to determine which result is the one they want when this happens.
# Because of this problem with redundant assigment of gene symbols,
# is it never advisable to use gene symbols as primary identifiers.
沒有人工調整,不可能知道哪個ID是「正確的」。因此,你的最簡單的辦法是讓所有可能的標識和符號爲表中的每個別名,同時保持相關信息哪些是受體,哪些是配體:
# your example subset with "A1B" and "trash" added for complexity
myTable <- data.frame(
ReceptorGene = c("A1B", "ACVR2B", "ACVR2B", "ACVR2B", "ACVR2B", "AMHR2", "BLR1", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A", "BMPR1A"),
LigandGene = c("trash", "INHA", "INHBA", "INHBB", "INHBC", "AMH", "SCYB13", "BMP10", "BMP15", "BMP2", "BMP3", "BMP4"),
stringsAsFactors = FALSE
)
# unlist and rename
my.aliases <- unlist(myTable)
names(my.aliases) <- paste(names(my.aliases), my.aliases, sep = ".")
# determine which aliases have a corresponding Entrez Gene ID
has.key <- my.aliases %in% keys(org.Hs.egALIAS2EG)
# replace Aliases with character vectors of all possible entrez gene IDs
my.aliases[has.key] <- sapply(my.aliases[has.key], function(x) {
eg.ids <- unlist(mget(x, org.Hs.egALIAS2EG))
symbols <- unlist(mget(eg.ids, org.Hs.egSYMBOL))
})
# my.aliases retains all pertinent information regarding the original alias
my.aliases[1:3]
# $ReceptorGene1.A1B
# 1 6641
# "A1BG" "SNTB1"
#
# $ReceptorGene2.ACVR2B
# 93
# "ACVR2B"
#
# $ReceptorGene3.ACVR2B
# 93
# "ACVR2B"
一旦你知道哪些Entrez基因ID是適當的,您可以將它們作爲附加列存儲在桌面上。
myTable$receptor.id <- c("1", "93", "93", "93", "93", "269", "643", "657", "657", "657", "657", "657")
myTable$ligand.id <- c(NA, "3623", "3624", "3625", "3626", "268", "10563", "27302", "9210", "650", "651", "652")
然後,當您需要更新到最新的符號時,您可以使用Entrez基因ID(永遠不需要更新)。
has.key <- myTable$receptor.id %in% keys(org.Hs.egSYMBOL)
myTable$ReceptorGene[has.key] <- unlist(mget(myTable$receptor.id[has.key], org.Hs.egSYMBOL))
has.key <- myTable$ligand.id %in% keys(org.Hs.egSYMBOL)
myTable$LigandGene[has.key] <- unlist(mget(myTable$ligand.id[has.key], org.Hs.egSYMBOL))
head(myTable)
# ReceptorGene LigandGene receptor.id ligand.id
# 1 A1BG trash 1 <NA>
# 2 ACVR2B INHA 93 3623
# 3 ACVR2B INHBA 93 3624
# 4 ACVR2B INHBB 93 3625
# 5 ACVR2B INHBC 93 3626
# 6 AMHR2 AMH 269 268
謝謝,你的代碼非常優雅,但是它有什麼問題。在代碼片段#2中,'v < - v [%%ls(org.Hs.egALIAS2EG)]中的v%'返回具有0列和470行的數據幀。我對'%in%'不夠熟悉,不知道它有什麼問題。 – enricoferrero
ls(org.Hs.egALIAS2EG)應給出映射中鍵的名稱(有效別名)。如果第一個元素(v)在第二個元素中,'%in%'返回TRUE,如果不是,則返回FALSE。該操作不應返回數據幀。你可以請張貼一個示例表嗎? (如myTable [1:10,1:10])? – January
你是對的,但是如果你看看你把它作爲一個子集返回給v的代碼:'v < - v [%%(org.Hs.egALIAS2EG)]',它返回一個空的數據幀。自己嘗試,也許我想的東西:'>噸<-myTable [40:50,] >噸 ReceptorGene LigandGene 40 ACVR2B INHA 41 ACVR2B INHBA 42 ACVR2B INHBB 43 ACVR2B INHBC 44 AMHR2 AMH 45 BLR1 SCYB13 46 BMPR1A BMP10 47 BMPR1A BMP15 48 BMPR1A BMP2 49 BMPR1A BMP3 50 BMPR1A BMP4 ' – enricoferrero