如何確定哪些特徵位於基因組的特定位置

我有興趣確定哪些特徵（即基因/ cds）位於基因組的特定位置。例如，哪個基因（如果有的話）包含2,000,000個位置。我知道如何通過for循環完成此操作，並循環遍歷基因組中的每個功能（下面包含的代碼），但這是我想要做數百萬次隨機化研究的一部分，這將花費比我想要的要長得多。如何確定哪些特徵位於基因組的特定位置

代碼下面包括了什麼，我試圖做更具體的例子：

from Bio import SeqIO 
import random 
GenomeSeq = SeqIO.read(open("reference_sequence.gbk", "r"), "genbank") 

interesting_position = random.randint(0, len(GenomeSeq)) 

for feature in GenomeSeq.features: # loop each position through whole genome 
    # In this particular case I'm interested in focusing on cds, but 
    # in others, I may be interested in other feature types? 
    if feature.type == "CDS": 
     if (feature.location._start.position <= interesting_position and 
      feature.location._end.position >= interesting_position): 
      try: 
       print feature.qualifiers['gene'] 
      except KeyError: 
       print feature

我想過做一個字典，對應於按鍵的基因內的每個位置，功能ID作爲值，作爲查找會比循環快，但它只是好像應該有辦法做到GenomeSeq[interestion_position].qualifiers['gene']

來源

2013-07-25 ded

類似'GenomeSeq [interesting_position] .features（）'，也許？ – verbsintransit

@verbsintransit是的，這將是偉大的，但它似乎沒有工作，我得到一個屬性錯誤（'str'對象沒有屬性'功能'）。這是否有用，或只是你想看到工作的東西？ – ded

我從來沒有用過BioPython，但我發現這個位於它的文檔中：http://biopython.org/wiki/SeqIO

from Bio import SeqIO 
handle = open("example.fasta", "rU") 
records = list(SeqIO.parse(handle, "fasta")) 
handle.close() 
print records[0].id #first record 
print records[-1].id #last record

這是你在找什麼？

來源

2013-07-25 21:06:57 sihrc

根據它們在fasta文件中的順序訪問不同的序列條目，而不是根據隨機位置確定特定的基因。 – ded

相當古老，但我會試一試。讓我們假設你要檢查點的隨機列表對於給定的基因組（和一些基因組不是一個固定的點的集合）：

from Bio import SeqIO 
import random 

GenomeSeq = SeqIO.read(open("sequence.gb", "r"), "genbank") 

# Here I generate a lot of random points in the genome in a set 
interesting_positions = set(random.sample(xrange(1, len(GenomeSeq)), 20000)) 


for feature in GenomeSeq.features: 

    if feature.type == "CDS": 
     # Here I create a set for every position covered by a feature 
     feature_span = set(xrange(feature.location._start.position, 
            feature.location._end.position)) 

     # And this is the "trick": check if the two sets overlaps: if any point 
     # is in the interesting_positions AND in the points covered by this 
     # feature, we are interested in the feature. 
     if feature_span & interesting_positions: 
      try: 
       print feature.qualifiers['gene'] 
      except KeyError: 
       print feature

對於20.000分和4.7MB的基因組中環約需在我的crapy-2003-計算機中3秒，200,000個隨機點5秒。

分析和一點點重構之後，我提取只需要計算一次去這個的所有行：

這會消耗每樣本+4秒讀0.2秒左右/準備基因組。在我的這臺舊機器上使用大約50個小時完成1,000,000份樣品。作爲一個隨機抽樣，在多色機器中啓動一些流程，明天就完成了。

來源

2014-08-04 19:47:49 xbello

感謝您的迴應，這實際上是我仍未解決的問題。看來這實際上是一個相同的問題：不得不逐步通過基因組特徵，每個隨機組1次。 200個隨機點是合理的，但理想情況下，我想至少重複一次100萬次（1388小時）。 – ded

@ded我給答案添加了一個最快的代碼。讓我知道它是否適合。 – xbello

如何確定哪些特徵位於基因組的特定位置

回答

相關問題