2017-09-19 51 views
0

我想從fasta文件中計數kmers。我有以下腳本:Python:從fasta文件計數kmers

import operator 

seq = open('file', 'r') 
kmers = {} 
k = 5 
for i in range(len(seq) - k + 1): 
    kmer = seq[i:i+k] 
    if kmer in kmers: 
     kmers[kmer] += 1 
    else: 
     kmers[kmer] = 1 

for kmer, count in kmers.items(): 
    print (kmer + "\t" + str(count)) 

sortedKmer = sorted(kmers.items(), key=itemgetter(1), reverse=True) 
for item in sortedKmer: 
    print (item[0] + "\t" + str(item[1])) 

這對於只有一個序列的文件工作正常,但現在我有幾個重疊羣一個FASTA文件。 我FASTA文件看起來像這樣:

>1 
GTCTTCCGGCGAGCGGGCTTTTCACCCGCTTTATCGTTACTTATGTCAGCATTCGCACTT 
CTGATACCTCCAGCAACCCTCACAGGCCACCTTCGCAGGCTTACAGAACGCTCCCCTACC 
CAACAACGCATAAACGTCGCTGCCGCAGCTTCGGTGCATGGTTTAGCCCCGTTACATCTT 
CCGCGCAGGCCGACTCGACCAGTGAGCTATTACGCTTTCTTTAAATGATGGCTGCTTCTA 
AGCCAACATCCTGGCTGTCTGG 
>2 
AAAGAAAGCGTAATAGCTCACTGGTCGAGTCGGCCTGCGCGGAAGATGTAACGGGGCTAA 
ACCATGCACCGAAGCTGCGGCAGCGACACTCAGGTGTTGTTGGGTAGGGGAGCGTTCTGT 
AAGCCTGTGAAGGTGGCCTGTGAGGGTTGCTGGAGGTATCAGAAGTGCGAATGCTGACAT 
AAGTAACGATAAAGCGGGTGAAAAGCCCGCTCGCCGGAAGACCAAGGGTTCCTGTCCAAC 
GTTAATCGGGGCAGG 

我如何更改腳本,它首先採取「> 1」後的順序,打印輸出,進入「1> 2」,打印輸出等?

+1

它可能存在的的圖書館閱讀FASTA文件。 Google:http://biopython.org/wiki/SeqIO – Benjamin

+0

根據[FASTA文件格式](http://bioperl.org/formats/sequence_formats/FASTA_sequence_format)'>之後的描述行是完全自由格式的,所以很難給出通用的解決方案。基本的pythonic方法是使用regex例如re.split('> \ d',s) –

回答

0

我從來沒有聽說過kmer或fasta,但我想我明白你在做什麼。

您可以嘗試在涉及'>'的正則表達式上進行分割,但是我建議在到達'> 1'行後適當地打印文件之前,逐行處理文件並累積kmers。請參見下面的代碼與意見

import operator 

def printSeq(name, seq): 
    # Extract your code into a function and print header for current kmer 
    print("%s\n################################" %name) 
    kmers = {} 
    k = 5 
    for i in range(len(seq) - k + 1): 
     kmer = seq[i:i+k] 
     if kmer in kmers: 
      kmers[kmer] += 1 
     else: 
      kmers[kmer] = 1 

    for kmer, count in kmers.items(): 
     print (kmer + "\t" + str(count)) 

    sortedKmer = sorted(kmers.items(), reverse=True) 

    for item in sortedKmer: 
     print (item[0] + "\t" + str(item[1])) 

with open('file', 'r') as f: 
    seq = "" 
    key = "" 
    for line in f.readlines(): 
     # Loop over lines in file 
     if line.startswith(">"): 
      # if we get '>' it is time for a new sequence 
      if key and seq: 
       # if it wasn't the first we should print it before overwriting the variables 
       printSeq(key, seq) 
      # store name after '>' and reset sequence 
      key = line[1:].strip() 
      seq = "" 
     else: 
      # accumulate kmer until we hit another '>' 
      seq += line.strip() 
    # when we are done with all the lines, print the last sequence 
    printSeq(key, seq) 
+0

是的,這是有效的。非常感謝! – Gravel

0

我試着用你的榜樣FASTA文件下面,它應該工作:

def count_kmers(seq, k, kmers): 
    for i in range(len(seq) - k + 1): 
     kmr = seq[i:i + k] 

     if kmr in kmers: 
      kmers[kmr] += 1 

     else: 
      kmers[kmr] = 1 

filename = raw_input('File name/path: ') 
k = input('Value for k: ') 
kmers = {} 

# Put each line of the file into a list (avoid empty lines) 
with open(filename) as f: 
    lines = [l.strip() for l in f.readlines() if l.strip() != ''] 

# Find the line indices where a new sequence starts 
idx = [i for (i, l) in enumerate(lines) if l[0] == '>'] 

idx += [len(lines)] 

for i in xrange(len(idx) - 1): 
    start = idx[i] + 1 
    stop = idx[i + 1] 
    sequence = ''.join(lines[start:stop]) 

    count_kmers(sequence, k, kmers) 

print kmers 

希望它能幫助:)