2009-12-20 68 views
6

我試圖只從NCBI的xml BLAST文件中提取第一個命中。接下來我只想得到第一個HSP。在最後階段,我想根據最佳分數獲得這些信息。 來把事情說清楚這裏的xml文件的樣本:如何從XML NCBI BLAST文件中提取第一個匹配元素?

<?xml version="1.0"?> 
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"> 
<BlastOutput> 
    <BlastOutput_program>blastx</BlastOutput_program> 
    <BlastOutput_version>blastx 2.2.22 [Sep-27-2009]</BlastOutput_version> 
    <BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~&quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference> 
    <BlastOutput_db>/Applications/blast/db/viral1.protein.faa</BlastOutput_db> 
    <BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID> 
    <BlastOutput_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</BlastOutput_query-def> 
    <BlastOutput_query-len>1024</BlastOutput_query-len> 
    <BlastOutput_param> 
    <Parameters> 
     <Parameters_matrix>BLOSUM62</Parameters_matrix> 
     <Parameters_expect>1e-05</Parameters_expect> 
     <Parameters_gap-open>11</Parameters_gap-open> 
     <Parameters_gap-extend>1</Parameters_gap-extend> 
     <Parameters_filter>F</Parameters_filter> 
    </Parameters> 
    </BlastOutput_param> 
    <BlastOutput_iterations> 
    <Iteration> 
     <Iteration_iter-num>1</Iteration_iter-num> 
     <Iteration_query-ID>lcl|1_0</Iteration_query-ID> 
     <Iteration_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</Iteration_query-def> 
     <Iteration_query-len>1024</Iteration_query-len> 
     <Iteration_stat> 
     <Statistics> 
      <Statistics_db-num>68007</Statistics_db-num> 
      <Statistics_db-len>19518578</Statistics_db-len> 
      <Statistics_hsp-len>0</Statistics_hsp-len> 
      <Statistics_eff-space>0</Statistics_eff-space> 
      <Statistics_kappa>0.041</Statistics_kappa> 
      <Statistics_lambda>0.267</Statistics_lambda> 
      <Statistics_entropy>0.14</Statistics_entropy> 
     </Statistics> 
     </Iteration_stat> 
     <Iteration_message>No hits found</Iteration_message> 
    </Iteration> 
    <Iteration> 
<Iteration> 
     <Iteration_iter-num>6</Iteration_iter-num> 
     <Iteration_query-ID>lcl|6_0</Iteration_query-ID> 
     <Iteration_query-def>DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev</Iteration_query-def> 
     <Iteration_query-len>1068</Iteration_query-len> 
     <Iteration_hits> 
     <Hit> 
      <Hit_num>1</Hit_num> 
      <Hit_id>gnl|BL_ORD_ID|23609</Hit_id> 
      <Hit_def>gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]</Hit_def> 
      <Hit_accession>23609</Hit_accession> 
      <Hit_len>293</Hit_len> 
      <Hit_hsps> 
      <Hsp> 
       <Hsp_num>1</Hsp_num> 
       <Hsp_bit-score>49.2914</Hsp_bit-score> 
       <Hsp_score>116</Hsp_score> 
       <Hsp_evalue>5.15408e-06</Hsp_evalue> 
       <Hsp_query-from>580</Hsp_query-from> 
       <Hsp_query-to>792</Hsp_query-to> 
       <Hsp_hit-from>202</Hsp_hit-from> 
       <Hsp_hit-to>273</Hsp_hit-to> 
       <Hsp_query-frame>-1</Hsp_query-frame> 
       <Hsp_identity>26</Hsp_identity> 
       <Hsp_positive>45</Hsp_positive> 
       <Hsp_gaps>2</Hsp_gaps> 
       <Hsp_align-len>73</Hsp_align-len> 
       <Hsp_qseq>MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI</Hsp_qseq> 
       <Hsp_hseq>MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL</Hsp_hseq> 
       <Hsp_midline>M I+ DV+ +T +++DD+ D T AAK L+E GA +V Y TH + S + +EN+ ++ +D + TN++</Hsp_midline> 
      </Hsp> 
      </Hit_hsps> 
     </Hit> 
     <Hit> 
      <Hit_num>2</Hit_num> 
      <Hit_id>gnl|BL_ORD_ID|2466</Hit_id> 
      <Hit_def>gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]</Hit_def> 
      <Hit_accession>2466</Hit_accession> 
      <Hit_len>3084</Hit_len> 
      <Hit_hsps> 
      <Hsp> 
       <Hsp_num>1</Hsp_num> 
       <Hsp_bit-score>48.9062</Hsp_bit-score> 
       <Hsp_score>115</Hsp_score> 
       <Hsp_evalue>6.70494e-06</Hsp_evalue> 
       <Hsp_query-from>369</Hsp_query-from> 
       <Hsp_query-to>875</Hsp_query-to> 
       <Hsp_hit-from>2312</Hsp_hit-from> 
       <Hsp_hit-to>2465</Hsp_hit-to> 
       <Hsp_query-frame>-2</Hsp_query-frame> 
       <Hsp_identity>52</Hsp_identity> 
       <Hsp_positive>70</Hsp_positive> 
       <Hsp_gaps>4</Hsp_gaps> 
       <Hsp_align-len>173</Hsp_align-len> 
      <Hsp_qseq>APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV</Hsp_qseq> 
      <Hsp_hseq>APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV</Hsp_hseq> 
      <Hsp_midline>AP +Q+P A ++ + K P + T + T A P + T  A PP+  PP S P A R P AP  P  P P+ P P+ A +T PR + P SA +AT AP SAK++ D P+AG PP+  GG V</Hsp_midline> 
     </Hsp> 
     </Hit_hsps> 
    </Hit> 
    </Iteration_hits> 
    <Iteration_stat> 
    <Statistics> 
     <Statistics_db-num>68007</Statistics_db-num> 
     <Statistics_db-len>19518578</Statistics_db-len> 
     <Statistics_hsp-len>0</Statistics_hsp-len> 
     <Statistics_eff-space>0</Statistics_eff-space> 
     <Statistics_kappa>0.041</Statistics_kappa> 
     <Statistics_lambda>0.267</Statistics_lambda> 
     <Statistics_entropy>0.14</Statistics_entropy> 
    </Statistics> 
    </Iteration_stat> 
</Iteration> 

基本上每個查詢搜索創建一個迭代元素。每個迭代可以有多個命中,而多個命中又可以有多個HSP。我只想得到第一個命中,它是每次迭代中的第一個HSP。如果BLAST發現沒有命中,我想忽略迭代。 我工作起來這個簡單的代碼:

#!/usr/bin/env python 
from elementtree.ElementTree import parse 
from elementtree import ElementTree as ET 
file = open("/Applications/blast/blanes_viral_nr_results.xml", "r") 
save_file = open("/Applications/blast/Blast_parse_ET.txt", 'w') 
tree = parse(file) 
elem = tree.getroot() 
print elem 
Per_ID =() 

save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t", "It_ID\t", "Hit_Def\t", "Num\t", "ID\t", "ACC\t")) 
iteration = tree.findall('BlastOutput_iterations/Iteration') 
for iteration in iteration: 
    for hit in iteration.findall('Iteration_hits/Hit'): 
    It_Num = iteration.findtext('Iteration_iter-num') 
    It_ID = iteration.findtext('Iteration_query-def') 
    Hit_Def = hit.findtext('Hit_def') 
    Num = hit.findtext('Hit_num') 
    ID = hit.findtext('Hit_id') 
    DEF = hit.findtext('Hit_def') 
    ACC = hit.findtext('Hit_accession') 
    save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num, It_ID[12:26], Hit_Def[1:10], Num, ID, ACC,)) 
    for hsp in hit.findall('Hit_hsps'): 
     HSPN = hsp.findtext('Hsp/Hsp_num') 
     identities = hsp.findtext('Hsp/Hsp_identity') 
     #print 'id: ', identities.rjust(4), 
     length = hsp.findtext('Hsp/Hsp_align-len') 
     #print 'len:', length.rjust(4), 
     Per_ID = int(identities) * 100.0/int(length) 
     #print hsp.findtext('Hsp/Hsp_qseq')[:50] 
     #print hsp.findtext('Hsp/Hsp_midline')[:50] 
     #print hsp.findtext('Hsp/Hsp_hseq')[:50] 
     save_file.write('%s\t%s\t%s\%st\n' % ('***', '%', HSPN, Per_ID)) 
    save_file.write('n\n' %()) 

任何幫助將大大appriciated!

回答

9

雖然構建您自己的解析器可能「很有趣」,但已經有一個可以解析BLAST xml文件的包......它甚至可以爲您執行本地BLAST實例的中間調用,如果您願意的話。

主要場地是在這裏: http://biopython.org/wiki/Biopython

和XML解析器BLAST是在這裏: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82

喜歡的東西:

from Bio.Blast import NCBIXML 
with open('xml/results/file') as handle: 
    all_records = NCBIXML.parse(handle) 
    first_record = all_records.next() 

應該工作。我通常喜歡BioPython解析器和編寫器,但我不喜歡類結構組織。所以我通常只使用解析器並將我需要的信息提取到我自己的結構中。 YMMV

希望有所幫助。

+0

感謝您的答覆。 我以前嘗試過類似的東西,現在我試過了你的建議,但是當我嘗試打電話給一個特定的項目時,我一直得到以下按摩: first_record = blast_records [0] TypeError:'generator'object是unsubscriptable 謝謝, – 2009-12-21 08:33:10

+1

我改變了答案...我有一段時間沒有使用biopython,我忘了他們重構的東西使用發電機。試試這個新的。 – JudoWill 2009-12-21 18:53:02

6

我會繼續介紹JudoWill的建議 - 更聰明地工作,而不是用BioPython解析器。 這應該讓你遠一點:

from Bio.Blast import NCBIXML 
blast = NCBIXML.parse(open('results.xml','rU')) 
for record in blast: 
    if record.alignments: 
     # to print the "best" matches e-score 
     print record.alignments[0].hsps[0].expect 
     # to print the "best" matches bit-score 
     print record.alignments[0].hsps[0].score 
     break 

這將第一查詢(返回的第一個和最好的比賽)後停止。由於我懷疑您可能想要在同一個文件中查詢其他查詢的結果,只需從最後一行中刪除break即可。

1

如果我理解了你的基本需求,那麼你想獲得查詢蛋白質/核苷酸序列的頂部命中/ HSP。爲什麼不在你的系統上安裝具有格式化nr/nt數據庫的獨立blast。 類型的選項

blastall -p {blast programme blastp for protein,blastn for nucleotide} -d {database} -i {input query} -v 1{for top hit} -b 1{alignment of the top hit with query} -m 7{xml blast output} -o example.xml 

打開MS Excel中的XML輸出文件,你可以看到頂級單命中blastoutput的表格的形式給每個查詢序列

相關問題