2011-12-12 86 views
1

我是一名Python初學者(我是一名生物學家),我有一個包含特定軟件結果的文件,我想用python解析結果。從下面的輸出中,我想得到的只是分數,並希望將序列分成單獨的氨基酸。使用python解析結果

沒有。得分順序

1 0.273778 FFHH-YYFLHRRRKKCCNNN-CCCK---HQQ---HHKKHV-FGGGE-EDDEDEEEEEEEE-EE-- 
2 0.394647 IIVVIVVVVIVVVVVVVVVV-CCCVA-IVVI--LIIIIIIIIYYYA-AVVVVVVVAAAAV-AST- 
3 0.456667  FIVVIVVVVIXXXXIGGGGT-CCCCAV -------------IVBBB-AAAAAA--------AAAA- 
4 0.407581 MMLMILLLLMVVAIILLIII-LLLIVLLAVVVVVAAAVAAVAIIII-ILIIIIIILVIMKKMLA- 
5 0.331761 AANSRQSNAAQRRQCSNNNR-RALERGGMFFRRKQNNQKQKKHHHY-FYFYYSNNWWFFFFFFR- 
6 0.452381 EEEEDEEEEEEEEEEEEEEE-EEEEESSTSTTTAEEEEEEEEEEEE-EEEEEEEEEEEEEEEEE- 
7 0.460385 LLLLLLLLMMIIILLLIIII-IIILLVILMMEEFLLLLILIVLLLM-LLLLLLLLLLVILLLVL- 
8 0.438680 ILILLVVVVILVVVLQLLMM-QKQLIVVLLVIIMLLLLMLLSIIIS-SMMMILFFLLILIIVVL- 
9 0.393291 QQQDEEEQAAEEEDEKGSSD-QQEQDDQDEEAAAHQLESSATVVQR-QQQQQVVYTHSTVTTTE- 

從上面的表格,我想獲得相同數量,分數,但其單獨的序列表(縱列) 所以它應該看起來像

no.  score   amino acid(1st column) 

1  0.273778   F 

2  0.395657   I 

3  0.456667   F 

另一個表代表第二列氨基酸

no  score  amino acid (2nd column) 

1  0.273778   F 

2  0.395657   I 

3  0.456667   I 

第三個表代表氨基酸的第三列和第四個第四列的表氨基酸MN等

在此先感謝您的幫助

+3

什麼的'F','I'和'F'立場?這些是上面字符串的第一個字符嗎?爲什麼'f'在第三行而不是'F'?我們不是Python的初學者,但我們也不是生物學家。我們可以用Python來幫助你,但你必須解釋這裏的個別氨基酸是什麼。 – eumiro

+0

它應該爲F ...我已編輯了問題(F,I; F)。是氨基酸代碼,這是alignment.I願與得分分裂整個sequnece縱列的序列的結果和序列號。 – hari

+0

你的描述如何去信件仍然不完全清楚。也許最好在序列中添加一些例子以及如何獲得理想的結果。 – hochl

回答

0

從你的例子,我想這:

  • 要每個表保存到不同的結果文件。
  • 每個序列長65個字符
  • 一些序列包含無意義的空格,其具有(在你的例子線3)被移除

這是我的代碼示例中,它從input.dat讀取數據和寫入結果result-column-<number>.dat

在本例中使用
import re 
import sys 

# I will write each table to different results-file. 
# dictionary to map columns (numbers) to opened file objects: 
resultfiles = {} 


def get_result_file(column): 
    # helper to easily access results file. 
    if column not in resultfiles: 
     resultfiles[column] = open('result-column-%d.dat' % column, 'w') 
    return resultfiles[column] 


# iterate over data: 
for line in open('input.dat'): 
    try: 
     # str.split(separator, maxsplit) 
     # with `maxsplit`=2 it is more fail-proof: 
     no, score, seq = line.split(None, 2) 

     # from your example I guess that white-spaces in sequence are meaningless, 
     # however in your example one sequence contains white-space, so I remove it: 
     seq = re.sub('\s+', '', seq) 

     # data validation will help to spot problems early: 
     assert int(no), no   
     assert float(score), score 
     assert len(seq) == 65, seq 

    except Exception, e: 
     # print the error and continue to process data: 
     print >> sys.stderr, 'Error %s in line: %s.' % (e, line) 
     continue # jump to next iteration of for loop. 

    # int(), float() will rise ValueError if no or score aren't numbers 
    # assert <condition> will rise AssertionError if condition is False. 

    # iterate over each character in amino sequance: 
    for column, char in enumerate(seq, 1): 
     f = get_result_file(column) 
     f.write('%s %s %s\n' % (no, score, char)) 


# close all opened result files: 
for f in resultfiles.values(): 
    f.close() 

值得注意的功能:

+0

感謝您的幫助,我得到了線26.assert INT錯誤(無),沒有 ValueError異常:對於int()無效文字基數爲10:「#column」 – hari

+0

你可以找到你的數據文件中的行包含文字「'#列」「?你能通過編輯你的問題向我展示那條線嗎?從你提供的數據樣本來看,這個錯誤不能上升。 – Ski

+0

我提供的數據僅僅是一個例子,也是我想要的「 - 」我的數據,以及他們的意思something.i不知道我可以上傳我的整個結果文件,可能是它可以幫助,不後悔能夠正確地解釋.. – hari

5

假設您已經打開包含數據f文件,那麼你的例可以用複製:

for ln in f: # loop over all lines 
    seqno, score, seq = ln.split() 
    print("%s %s %s" % (seqno, score, seq[0])) 

要拆出的順序,你需要另外遍歷所有的字母seq

for ln in f: 
    seqno, score, seq = ln.split() 
    for x in seq: 
     print("%s %s %s" % (seqno, score, seq[0])) 

這將打印序列NU mber和得分很多次。我不確定這是你想要的。

+1

如果你打算用序列進一步做任何事情,我建議將其轉換爲Biopython(www.biopython.org)Sequence對象。 – 2011-12-12 12:03:13

+0

感謝您的建議,我想只是分割序列,我已編輯相應的問題。 – hari

0

我不認爲它是有用的創建表。
只要把數據在調整結構和使用功能,顯示你需要在你需要的時刻是什麼:

with open('bio.txt') as f: 
    data = [line.rstrip().split(None,2) for line in f if line.strip()] 


def display(data,nth,pat='%-6s %-15s %s',uz=('th','st','nd','rd')): 
    print pat % ('no.','score', 
       'amino acid(%d%s column)' %(nth,uz[0 if nth//4 else nth])) 
    print '\n'.join(pat % (a,b,c[nth-1]) for a,b,c in data)  

display(data,1) 
print 
display(data,3) 
print 
display(data,7) 

結果

no.  score   amino acid(1st column) 
1  0.273778   F 
2  0.394647   I 
3  0.456667   F 
4  0.407581   M 
5  0.331761   A 
6  0.452381   E 
7  0.460385   L 
8  0.438680   I 
9  0.393291   Q 

no.  score   amino acid(3rd column) 
1  0.273778   H 
2  0.394647   V 
3  0.456667   V 
4  0.407581   L 
5  0.331761   N 
6  0.452381   E 
7  0.460385   L 
8  0.438680   I 
9  0.393291   Q 

no.  score   amino acid(7th column) 
1  0.273778   Y 
2  0.394647   V 
3  0.456667   V 
4  0.407581   L 
5  0.331761   S 
6  0.452381   E 
7  0.460385   L 
8  0.438680   V 
9  0.393291   E 
0

下面是一個簡單可行的解決方案:

#opening file: "db.txt" full path to file if it is in the same directory as python file 
#you can use any extension for the file ,'r' for reading mode 
filehandler=open("db.txt",'r') 
#Saving all the lines once in a list every line is a list member 
#Another way: you can read it line by line 
LinesList=filehandler.readlines() 
#creating an empty multi dimension list to store your results 
no=[] 
Score=[] 
AminoAcids=[] # this is a multi-dimensional list for example index 0 has a list of char. of first line and so on 
#process each line assuming constant spacing in the input file 
#no is the first char. score from char 4 to 12 and Amino from 16 to end 
for Line in LinesList: 
    #add the no 
    no.append(Line[0]) 
    #add the score 
    Score.append(Line[4:12]) 
    Aminolist=list(Line[16:]) #breaking the amino acid as each character is a list element 
    #add Aminolist to the AminoAcids Matrix (multi-dimensional array) 
    AminoAcids.append(Aminolist) 

#you can now play with the data! 
#printing Tables ,you can also write them into a file instead 
for k in range(0,65): 
    print"Table %d" %(k+1) # adding 1 to not be zero indexed 
    print"no. Score  amino acid(column %d)" %(k+1) 
    for i in range(len(no)): 
     print "%s %s %s" %(no[i],Score[i],AminoAcids[i][k]) 

這裏是結果的一部分出現在控制檯上:

Table 1 
no. Score  amino acid(column 1) 
1 0.273778 F 
2 0.394647 I 
3 0.456667 F 
4 0.407581 M 
5 0.331761 A 
6 0.452381 E 
7 0.460385 L 
8 0.438680 I 
9 0.393291 Q 
Table 2 
no. Score  amino acid(column 2) 
1 0.273778 F 
2 0.394647 I 
3 0.456667 I 
4 0.407581 M 
5 0.331761 A 
6 0.452381 E 
7 0.460385 L 
8 0.438680 L 
9 0.393291 Q 
Table 3 
no. Score  amino acid(column 3) 
1 0.273778 H 
2 0.394647 V 
3 0.456667 V 
4 0.407581 L 
5 0.331761 N 
6 0.452381 E 
7 0.460385 L 
8 0.438680 I 
9 0.393291 Q 
Table 4 
no. Score  amino acid(column 4) 
1 0.273778 H 
2 0.394647 V 
3 0.456667 V 
4 0.407581 M 
5 0.331761 S 
6 0.452381 E 
7 0.460385 L 
8 0.438680 L 
9 0.393291 D 
Table 5 
no. Score  amino acid(column 5) 
1 0.273778 - 
2 0.394647 I 
3 0.456667 I 
4 0.407581 I 
5 0.331761 R 
6 0.452381 D 
7 0.460385 L 
8 0.438680 L 
9 0.393291 E 
Table 6 
no. Score  amino acid(column 6) 
1 0.273778 Y 
2 0.394647 V 
3 0.456667 V 
4 0.407581 L 
5 0.331761 Q 
6 0.452381 E 
7 0.460385 L 
8 0.438680 V 
9 0.393291 E 
Table 7 
no. Score  amino acid(column 7) 
1 0.273778 Y 
2 0.394647 V 
3 0.456667 V 
4 0.407581 L 
5 0.331761 S 
6 0.452381 E 
7 0.460385 L 
8 0.438680 V 
9 0.393291 E