0
基本上,問題是要求找出DNA字符串集合中不超過d個不匹配的所有可能的基序(k-mers long)。我可以編寫下面的代碼來查找一個字符串DNA的所有基序(k,d)。當它出現多行字符串DNA時,我不知道如何修改我的代碼。查找DNA字符串集合中的所有(k,d) - 基元
樣品輸入:
K = 3,d = 1
ATTTGGC
TGCCTTA
CGGTATC
GAAAATT
樣本輸出:
ATA
ATT
GTT
TTT
import collections
kmer = 5;
in_genome = "GGGGCTTCACAGCGCCCCTACAATACAATAGCCCTCGAATACCTACTTGCCACTATGTTCGGCGTCATTACATACGACCCGCATGCTCGGCAGTATGTCTCTACTCAGGATCCCTCAATATTACTTACGCCAATATGTCTAAGGTTTAGA";
in_mistake = 1;
out_result = [];
mismatch_list = []
def hamming_distance(s1, s2):
# Return the Hamming distance between equal-length sequences
if len(s1) != len(s2):
raise ValueError("Undefined for sequences of unequal length")
else:
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
for i in xrange(len(in_genome)-kmer + 1):
v = in_genome[i:i + kmer]
out_result.append(v)
for t_kmer in set(out_result):
for s_kmer in out_result:
if hamming_distance(t_kmer, s_kmer) <= in_mistake:
mismatch_list.append(t_kmer)
mismatch_count = collections.Counter(mismatch_list)
print mismatch_count
什麼問題PLZ? – Aprillion
能否詳細說明'd'的含義?定義一個不匹配 – Pynchia
你可以將所有這些行連接到字符串in_genome –