2014-02-16 46 views
4

嗯,我的挑戰看起來很簡單,但我用完了選項。所以任何幫助將不勝感激。在csv文件中基於切片索引連接切片的字符串

我有很多fasta格式的DNA序列,它們需要在特定位置切片,然後連接所得到的部分。所以,如果我的序列文件是像這樣:

~$ cat seq_file 
>Sequence1 
This is now a sequence that must require a bit of slicing and concatenation to be useful 
>Sequence2 
I have many more uncleaned strings like this in the form of sequences 

我所要的輸出是這樣:

>Sequence1 
This is useful 
>Sequence2 
I have cleaned sequences 

現在片部分是由從單獨的csv文件分片索引確定。在這種情況下,切片位置被組織成這樣:

~$ cat test.csv 
Sequence1,0,9,66,74,, 
Sequence2,0,5,15,22,48,57 

我的代碼:

from Bio import SeqIO 
import csv 

seq_dict = {} 
for seq_record in SeqIO.parse('seq_file', 'fasta'): 
    descr = seq_record.description 
    seq_dict[descr] = seq_record.seq 

with open('test.csv', 'rb') as file: 
    reader = csv.reader(file) 
    for row in reader: 
     seq_id = row[0] 
     for n in range(1,7): 
      if n % 2 != 0: 
       start = row[n] # all start positions for the slice occupy non-even rows 
      else: 
       end = row[n] 

       for key, value in sorted(seq_dict.iteritems()): 
        #print key, value 
        if key == string_id: # cross check matching sequence identities 
         try: 
          slice_seq = value[int(start):int(end)] 
          print key 
          print slice_seq 
         except ValueError: 
          print 'Ignore empty slice indices.. ' 

現在,這將打印:

Sequence1 
Thisisnow 
Sequence1 
useful 
Ignore empty slice indices.. 
Sequence2 
Ihave 
Sequence2 
cleaned 
Sequence2 
sequences 

到目前爲止好,這是我所期待的。但是,如何通過連接或連接或通過python中的任何可能操作將切片部分連接到一起以達到我想要的目的?謝謝。

回答

2

你可以做到這一點與一對夫婦的修改:

with open('test.csv', 'rb') as file: 
    reader = csv.reader(file) 
    for row in reader: 
     seq_id = row[0] 
     seqs = [] 
     for n in range(1,7): 
      if n % 2 != 0: 
       start = row[n] # all start positions for the slice occupy non-even rows 
      else: 
       end = row[n] 

       for key, value in sorted(seq_dict.iteritems()): 
        #print key, value 
        if key == seq_id: # cross check matching sequence identities 
         try: 
          seqs.append(value[int(start):int(end)]) 
         except ValueError: 
          print 'Ignore empty slice indices.. ' 
     print ' '.join(str(x) for x in seqs) 
+0

我用'STR(X)'從'Seq'對象中創建字符串,因爲'join'不能與它們一起工作。 –

+0

非常感謝。非常!! – user3014974

3

事情是這樣的:

import csv 
from string import whitespace 
with open('seq_file') as f1, open('test.csv') as f2: 
    for row in csv.reader(f2): 
     it = iter(map(int, filter(None, row[1:]))) 
     slices = [slice(*(x,next(it))) for x in it] 
     seq = next(f1) 
     line = next(f1).translate(None, whitespace) 
     print seq, 
     print ' '.join(line[s] for s in slices) 

輸出:

>Sequence1 
Thisisnow useful 
>Sequence2 
Ihave cleaned sequences 
+0

太好了。這個地方很棒。 – user3014974