2014-01-30 116 views
0

我與biopython有點爭鬥,我試圖根據3個標準​​過濾一組蛋白質序列: 1)該序列包含一個起始密碼子,在我的protein.fasta文件中表示爲M 2)該序列包含一個終止密碼子,表示爲* 3)M和*之間的長度至少爲我期望的長度的90%,這是一個新文件過濾器序列

這是我試圖做的,這些定義的條件只是在我的腦海裏一團糟,一點點的幫助纔會真正感激!

from Bio import SeqIO 

source = 'protein.fasta' 
outfile = 'filtered.fa' 
sub1 ='M' 
sub2 = '*' 
length = 'protein_length.txt' 

def seq_check(seq, sub1, sub2): 
# basically a function to check whether seq contains both M and *, and is of the expected length 

return 

seqs = SeqIO.parse(source, 'fasta') 
filtered = (seq for seq in seqs if seq_check(seq.seq, sub1, sub2, length)) 
SeqIO.write(filtered, outfile, 'fasta') 


Protein datafile: 
>comp12_c0_seq1:217-297 
SR*THDYAALLTSHRSLDLVYVYNVV 
>comp15_c0_seq1:3-197 
*LCI*SCIVRVWLRYPSP*LANYFPQM*RLSAIRLF*ERLIYGPFLC*NYF*S*PKIAVHTYRS 

Length datafile: 
comp12_c0_seq1 50 
comp15_c0_seq1 80 

感謝您的幫助 克萊爾

回答

1

如果你能肯定的是,蛋白質和長度的文件都以相同的順序,你可能需要對代碼進行修改,以使用字典是更多的內存高效的大型數據集,例如編寫一個生成函數,生成第二列轉換爲int,然後用SeqIO.parse()與itertools.izip()進行比較。

def read_lengths(path): 
    """Reads "length file" into dict mapping sequence ID to length.""" 
    lengths = {} 
    with open(path) as f: 
     for line in f: 
      seq, length = line.strip().split() 
      lengths[seq] = int(length) 
    return lengths 


def enclosed_substrings(s, start, stop): 
    """Find all substrings starting with `start` and ending with `stop`.""" 
    startpos = 0 
    stoppos = 0 
    while True: 
     startpos = s.find(start, startpos) 
     if startpos < 0: 
      break 
     stoppos = s.find(stop, startpos + 1) 
     if stoppos < 0: 
      break 
     yield s[startpos:stoppos + 1] 
     startpos += 1 


def seq_check(record, expected_lens, len_factor=0.9, start='M', stop='*'): 
    min_len = expected_lens[record.id] * len_factor 
    for sub in enclosed_substrings(record.seq, start, stop): 
     if len(sub) >= min_len: 
      return True 
    return False 


source_file = 'protein.fasta' 
out_file = 'filtered.fa' 
length_file = 'protein_length.txt' 

expected_lengths = read_lengths(length_file) 
seqs = SeqIO.parse(source_file, 'fasta') 
filtered = (seq for seq in seqs if seq_check(seq, expected_lengths)) 
SeqIO.write(filtered, out_file, 'fasta') 
+0

非常感謝你這正是我想要的,我沒有意識到這是複雜的。感謝您花時間幫助我解決問題 – user3188922

+0

我不認爲這很複雜。如果你願意,你也可以用're.findall()'代替'enclosed_substrings'。但是,http://xkcd.com/1171/ –