1
我試圖用Biopython提取所有DNA序列從包含有以下的短DNA序列匹配一個FASTA文件:「GGCTCAACCCTGGA」使用Biopython發現並提取FASTA匹配精確DNA序列
以下是我迄今爲止:
from Bio import SeqIO
source = "rep_set_no_spaces.fasta"
outfile = "rep_set_PNA_matches.fasta"
seq1 = "GGCTCAACCCTGGA"
# basically a function to check whether seq contains sub1
def seq_check(seq, seq1):
return seq.find(seq1)
seqs = SeqIO.parse(source, 'fasta')
filtered = (seq for seq in seqs if seq_check(seq.seq, seq1))
SeqIO.write(filtered, outfile, 'fasta')
我想從這個崗位(Filtering a FASTA file based on sequence with BioPython)適應代碼,但我感興趣的序列既不是在一開始也沒有序列結束......
例如,這裏是我的一些順序ces ...第一和第四順序匹配,但第二和第三順序不匹配。我想拔出序列作出新的fasta文件只有那些包含「GGCTCAACCCTGGA」序列:
>110148arco.1D_184193
TACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGCACGTAGGTGGATTGGAAAGTATGGGGTGAAATCCCAGGGCTCAACCCTGGAACTGCCTCATAAACTATCAGTCTAGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACACTGAGGTGCGAAAGTGTGGGGAGCAAACAGG
>110475arco.1D_40770
TACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGTTAAGTCAGCTGTGAAAGCCCTGGGCTCAACCTGGGAATTGCAGTTGATACTGGCAAGCTGGAGTACGAGAGAGGGAGGTAGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAATACCAGTGGCGAAGGCGGCCTCCTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
>110484arco.1D_190999
TACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGCGGTTTGTTAAGTCAGCTGTGAAAGCCCTGGGCTCAACCTGGGAATTGCAGTTGATACTGATCGACTAGAGTACGAGAGAGGGAGGTAGAATTCCACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAATACCGGTGGCGAAGGCGGCCTCCTGGCTCGATACTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGG
>110525amin.3D_40107
TACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTAGTAAGTAAGATGTGAAATCCCAGGGCTCAACCCTGGAACTGCATTTTAAACTGCTAGTCTAGAGTTATGGAGAGGTAAGTGGAATTCCTAGTGTAGAGGTGAAATTCGTAGATATTAGGAGGAACACCAGAGGCGAAGGCGACTTACTGGACATATACTGACGCTGAGGTACGAAAGTGTGGGTAGCAAACAGG
謝謝!
更succinent:'高清seq_check(SEQ, seq1):返回seq1 seq' – BioGeek
謝謝你的答案!這工作完美:) –
@Brooke_W如果答案解決了你的問題,你應該接受答案 – Markus