2016-02-21 61 views
0

我想轉換包含幾個DNA序列成二進制值,其是如下文件:Python:如何使用二進制值編碼DNA序列?

A=1000 
C=0100 
G=0010 
T=0001 

FileA.txt

CCGAT 
GCTTA 

希望的輸出

01000100001010000001 
00100100000100011000 

我已經嘗試使用此代碼來解決我的問題但是bin輸出文件似乎沒有輸出我想要的答案。誰能幫我?

代碼

import sys 

if len(sys.argv) != 2 : 
    sys.stderr.write('Usage: {} <nucleotide file>\n'.format(sys.argv[0])) 
    sys.exit() 

# assumes the file only contains dna and newlines 
sequence = '' 
for line in open(sys.argv[1]) : 
    sequence += line.strip().upper() 

sequence = sequence.replace('A', chr(0b1000)) 
sequence = sequence.replace('C', chr(0b0100)) 
sequence = sequence.replace('G', chr(0b0010)) 
sequence = sequence.replace('T', chr(0b0001)) 

outfile = open(sys.argv[1] + '.bin', 'wb') 

outfile.write(bytearray(sequence, encoding = 'utf-8')) 
+0

你是否螞蟻實際的二進制文件,或者你想要一個文件的字符串表示'1000','0100',...? – wwii

+1

您可以用'A =「00」將您的編碼字符串切成兩半; C = 「01」; G = 「10」; T =「11」' – PaulMcG

回答

1

你想ASCII輸出或二進制?下面會給你你在帖子中顯示的內容(儘管在一行中,代碼需要修改以保持換行)。

import sys 

if len(sys.argv) != 2 : 
    sys.stderr.write('Usage: {} <nucleotide file>\n'.format(sys.argv[0])) 
    sys.exit() 

# assumes the file only contains dna and newlines 
sequence = '' 
for line in open(sys.argv[1]) : 
    sequence += line.strip().upper() 

sequence = sequence.replace('A', '1000') 
sequence = sequence.replace('C', '0100') 
sequence = sequence.replace('G', '0010') 
sequence = sequence.replace('T', '0001') 

outfile = open(sys.argv[1] + '.bin', 'wb') 

outfile.write(sequence) 

EDIT此創建二進制文件,其中每個核苷酸是一個字節和換行被以二進制格式保存。

import sys 

if len(sys.argv) != 2 : 
    sys.stderr.write('Usage: {} <nucleotide file>\n'.format(sys.argv[0])) 
    sys.exit() 

# assumes the file only contains dna and newlines 
newbytearray=bytearray(b'',encoding='utf-8') 
dict={'A':0b1000,'C':0b0100,'G':0b0010,'T':0b0001,'\n':0b1010} 
with open(sys.argv[1]) as file: 
    while True: 
     char=file.read(1) 
     if not char: 
      file.close() 
      break 
     newbytearray.append(dict[char]) 
outfile = open(sys.argv[1] + '.bin', 'wb') 
outfile.write(newbytearray) 
outfile.close() 

#Converts the binary file to unicode and prints the result sequence. 
testBin = open('fileA.txt.bin','rb') 
sequence='' 
for line in testBin: 
    line = line.replace(chr(0b1000),'1000') 
    line = line.replace(chr(0b0100),'0100') 
    line = line.replace(chr(0b0010),'0010') 
    line = line.replace(chr(0b0001),'0001') 
    line = line.replace(chr(0b1010),'\n') 
    sequence += line 
#outputVerify = open('outputVerify.txt','wb') 
#outputVerify.write(sequence) 
#outputVerify.close() 
print sequence 
testBin.close() 

#Shows the data of the binary file. Note that byte 6 is the newline character 0b1010. 
testBin = open('fileA.txt.bin','rb') 
list = '' 
i=0 
while True: 
    b = testBin.read(1) 
    i += 1 
    if not b: 
    break #due to eof 
    list += b 
    print 'byte: ' + str(i) + ' is '+ '{0:04b}'.format(ord(b)) +' and has decimal representation: ' + str(ord(b)) 
testBin.close() 
+0

二進制輸出優先於換行符。謝謝。 – Xiong89

+0

Hi Xiong。如果這能解決您的問題,請不要忘記接受我的答案。否則,讓我知道什麼是錯的/如果你有任何問題。謝謝。 – weezilla

+0

如果這不是用於任務或與某人的軟件接口,我建議您將核苷酸編碼爲0b00,0b01,0b10和0b11以節省時間和空間。您仍然可以使用4位0b1010換行符來分離核苷酸序列。 – weezilla

3
import re 

d = {'A' :'1000','C' : '0100','G':'0010','T': '0001'} 

patterns = ['CCGAT' ,'GCTTA'] 

for p in patterns: 
    for c in p: 
     p = re.sub(c,d[c],p) 
    print(p) 
+0

而不是使用'''re''',爲什麼不直接使用映射 - '''[d [char]用於模式中char模式的模式]''' – wwii

+0

我想做直接在圖案p上的正確位置進行更換。這似乎更簡單 – yael

相關問題