過濾器序列

我與biopython有點爭鬥，我試圖根據3個標準過濾一組蛋白質序列： 1）該序列包含一個起始密碼子，在我的protein.fasta文件中表示爲M 2）該序列包含一個終止密碼子，表示爲* 3）M和*之間的長度至少爲我期望的長度的90％，這是一個新文件過濾器序列

這是我試圖做的，這些定義的條件只是在我的腦海裏一團糟，一點點的幫助纔會真正感激！

from Bio import SeqIO 

source = 'protein.fasta' 
outfile = 'filtered.fa' 
sub1 ='M' 
sub2 = '*' 
length = 'protein_length.txt' 

def seq_check(seq, sub1, sub2): 
# basically a function to check whether seq contains both M and *, and is of the expected length 

return 

seqs = SeqIO.parse(source, 'fasta') 
filtered = (seq for seq in seqs if seq_check(seq.seq, sub1, sub2, length)) 
SeqIO.write(filtered, outfile, 'fasta') 


Protein datafile: 
>comp12_c0_seq1:217-297 
SR*THDYAALLTSHRSLDLVYVYNVV 
>comp15_c0_seq1:3-197 
*LCI*SCIVRVWLRYPSP*LANYFPQM*RLSAIRLF*ERLIYGPFLC*NYF*S*PKIAVHTYRS 

Length datafile: 
comp12_c0_seq1 50 
comp15_c0_seq1 80

感謝您的幫助克萊爾

來源

2014-01-30 user3188922

如果你能肯定的是，蛋白質和長度的文件都以相同的順序，你可能需要對代碼進行修改，以使用字典是更多的內存高效的大型數據集，例如編寫一個生成函數，生成第二列轉換爲int，然後用SeqIO.parse（）與itertools.izip（）進行比較。

def read_lengths(path): 
    """Reads "length file" into dict mapping sequence ID to length.""" 
    lengths = {} 
    with open(path) as f: 
     for line in f: 
      seq, length = line.strip().split() 
      lengths[seq] = int(length) 
    return lengths 


def enclosed_substrings(s, start, stop): 
    """Find all substrings starting with `start` and ending with `stop`.""" 
    startpos = 0 
    stoppos = 0 
    while True: 
     startpos = s.find(start, startpos) 
     if startpos < 0: 
      break 
     stoppos = s.find(stop, startpos + 1) 
     if stoppos < 0: 
      break 
     yield s[startpos:stoppos + 1] 
     startpos += 1 


def seq_check(record, expected_lens, len_factor=0.9, start='M', stop='*'): 
    min_len = expected_lens[record.id] * len_factor 
    for sub in enclosed_substrings(record.seq, start, stop): 
     if len(sub) >= min_len: 
      return True 
    return False 


source_file = 'protein.fasta' 
out_file = 'filtered.fa' 
length_file = 'protein_length.txt' 

expected_lengths = read_lengths(length_file) 
seqs = SeqIO.parse(source_file, 'fasta') 
filtered = (seq for seq in seqs if seq_check(seq, expected_lengths)) 
SeqIO.write(filtered, out_file, 'fasta')

來源

2014-01-30 08:28:10

非常感謝你這正是我想要的，我沒有意識到這是複雜的。感謝您花時間幫助我解決問題 – user3188922

我不認爲這很複雜。如果你願意，你也可以用're.findall（）'代替'enclosed_substrings'。但是，http://xkcd.com/1171/ –

回答

相關問題