參考使用Python的名稱列表

我不能讓這個小腳本正常工作：

genome = open('refT.txt','r')

數據文件 - 與重疊羣的一束（200萬美元）的參考基因組：

Contig_01 
TGCAGGTAAAAAACTGTCACCTGCTGGT 
Contig_02 
TGCAGGTCTTCCCACTTTATGATCCCTTA 
Contig_03 
TGCAGTGTGTCACTGGCCAAGCCCAGCGC 
Contig_04 
TGCAGTGAGCAGACCCCAAAGGGAACCAT 
Contig_05 
TGCAGTAAGGGTAAGATTTGCTTGACCTA

打開文件：

cont_list = open('dataT.txt','r')

重疊羣的名單，我想從數據集列表中提取ED以上：

Contig_01 
Contig_02 
Contig_03 
Contig_05

我無望的腳本：

for line in cont_list: 
    if genome.readline() not in line: 
     continue 
    else: 
     a=genome.readline() 
     s=line+a  
     data_out = open ('output.txt','a') 
     data_out.write("%s" % s) 
     data_out.close() 

input('Press ENTER to exit')

腳本成功寫入前三重疊羣到輸出文件，但由於某種原因，似乎可以跳過「contig_04」不，它不在列表中，然後轉到「Contig_05」。

我似乎是一個懶惰的混蛋張貼這一點，但我花了整個下午的代碼-_-這點點

來源

2014-02-18 user2406056

的問題是，你的'continue'讓你跳過線'cont_list'。你只需循環基因組，直到你找到'line' – njzk2

除了跳過的行，是否保證在'cont_list'和'genome'文件中以相同的順序顯示行名？ – user2357112

你可以簡單地通過替換'if'來解決它'while' – njzk2

我先試着產生一個迭代它給你一個元組：(contig, gnome)：現在

def pair(file_obj): 
    for line in file_obj: 
     yield line, next(file_obj)

，我會用它來獲得所需的元素：

wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'} 
with open('filename') as fin: 
    pairs = pair(fin) 
    while wanted: 
     p = next(pairs) 
     if p[0] in wanted: 
      # write to output file, store in a list, or dict, ... 
      wanted.forget(p[0])

來源

2014-02-18 15:54:37 mgilson

我推薦幾件事情：

嘗試使用with open(filename, 'r') as f而不是f = open(...)/f.close()。 with將爲您處理關門。它也鼓勵你在一個地方處理你的所有文件IO。
試着讀入你想要的所有重疊羣到一個列表或其他結構中。一次打開多個文件是很痛苦的。一次讀取所有行並存儲它們。

下面是一些示例代碼，可能你在找什麼

from itertools import izip_longest 

# Read in contigs from file and store in list 
contigs = [] 
with open('dataT.txt', 'r') as contigfile: 
    for line in contigfile: 
     contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL 

# Read through genome file, open up an output file 
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile: 
    # Nifty way to sort through fasta files 2 lines at a time 
    for name, seq in izip_longest(*[genomefile]*2): 
     # compare the contig name to your list of contigs 
     if name.rstrip() in contigs: 
      outfile.write(name) #optional. remove if you only want the seq 
      outfile.write(seq)

來源

2014-02-18 15:56:10 wflynny

這裏是一個非常緊湊的方法以獲得您想要的序列。

def get_sequences(data_file, valid_contigs): 
    sequences = [] 

    with open(data_file) as cont_list: 
     for line in cont_list: 
      if line.startswith(valid_contigs): 
       sequence = cont_list.next().strip() 
       sequences.append(sequence) 

    return sequences 

if __name__ == '__main__': 
    valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05') 
    sequences = get_sequences('dataT.txt', valid_contigs) 
    print(sequences)

利用startswith（）接受元組作爲參數並檢查任何匹配的能力。如果該行符合你想要的（想要的重疊羣），它將抓取下一行，並在刪除不需要的空白字符後將其附加到序列。從那裏，將抓取的序列寫入輸出文件非常簡單。

輸出示例：

['TGCAGGTAAAAAACTGTCACCTGCTGGT', 
'TGCAGGTCTTCCCACTTTATGATCCCTTA', 
'TGCAGTGTGTCACTGGCCAAGCCCAGCGC', 
'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

來源

2014-02-18 16:02:11

參考使用Python的名稱列表

回答

相關問題