2013-11-25 35 views
3

我正在試驗BCBio的GFF解析器,希望我可以將它用於我的工具。我從NCBI的RefSeq數據庫中獲取了一個測試.gbk文件,並用它來解析.gff文件。BCBio的GFF解析器不正確的解析

代碼我使用(從http://biopython.org/wiki/GFF_Parsing):

#!/usr/bin/python 
from BCBio import GFF 
from Bio import SeqIO 

def convert_to_GFF3(): 
    in_file = "/var/www/localhost/NC_009925.gbk" 
    out_file = "/var/www/localhost/output/your_file.gff" 
    in_handle = open(in_file) 
    out_handle = open(out_file, "w") 

    GFF.write(SeqIO.parse(in_handle, "genbank"), out_handle) 

    in_handle.close() 
    out_handle.close() 

convert_to_GFF3() 

這裏是結果的一部分:

##gff-version 3 
##sequence-region NC_009925.1 1 6503724 
NC_009925.1 annotation remark 1 6503724 . . . accessions=NC_009925;comment=PROVISIONAL REFSEQ: This record has not yet been subject to final%0ANCBI review. The reference sequence was derived from CP000828.%0ASource bacteria from Marine Biotechnology Institute Culture%0ACollection%2C Marine Biotechnology Institute%2C 3-75-1 Heita%2C Kamaishi%2C%0AIwate 026-0001%2C Japan.%0ACOMPLETENESS: full length.;data_file_division=CON;date=10-JUN-2013;gi=158333233;keywords=;organism=Acaryochloris marina MBIC11017;references=location: %5B0:6503724%5D%0Aauthors: Swingley%2CW.D.%2C Chen%2CM.%2C Cheung%2CP.C.%2C Conrad%2CA.L.%2C Dejesa%2CL.C.%2C Hao%2CJ.%2C Honchak%2CB.M.%2C Karbach%2CL.E.%2C Kurdoglu%2CA.%2C Lahiri%2CS.%2C Mastrian%2CS.D.%2C Miyashita%2CH.%2C Page%2CL.%2C Ramakrishna%2CP.%2C Satoh%2CS.%2C Sattley%2CW.M.%2C Shimada%2CY.%2C Taylor%2CH.L.%2C Tomo%2CT.%2C Tsuchiya%2CT.%2C Wang%2CZ.T.%2C Raymond%2CJ.%2C Mimuro%2CM.%2C Blankenship%2CR.E. and Touchman%2CJ.W.%0Atitle: Niche adaptation and genome expansion in the chlorophyll d-producing cyanobacterium Acaryochloris marina%0Ajournal: Proc. Natl. Acad. Sci. U.S.A. 105 %286%29%2C 2005-2010 %282008%29%0Amedline id: %0Apubmed id: 18252824%0Acomment:,location: %5B0:6503724%5D%0Aauthors: %0Aconsrtm: NCBI Genome Project%0Atitle: Direct Submission%0Ajournal: Submitted %2817-OCT-2007%29 National Center for Biotechnology Information%2C NIH%2C Bethesda%2C MD 20894%2C USA%0Amedline id: %0Apubmed id: %0Acomment:,location: %5B0:6503724%5D%0Aauthors: Touchman%2CJ.W.%0Atitle: Direct Submission%0Ajournal: Submitted %2827-AUG-2007%29 Pharmaceutical Genomics Division%2C Translational Genomics Research Institute%2C 13208 E Shea Blvd%2C Scottsdale%2C AZ 85004%2C USA%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;source=Acaryochloris marina MBIC11017;taxonomy=Bacteria,Cyanobacteria,Oscillatoriophycideae,Chroococcales,Acaryochloris 
NC_009925.1 feature source 1 6503724 . + . db_xref=taxon:329726;mol_type=genomic DNA;note=type strain of Acaryochloris marina;organism=Acaryochloris marina MBIC11017;strain=MBIC11017 
NC_009925.1 feature gene 931 1581 . - . db_xref=GeneID:5685235;locus_tag=AM1_0001;note=conserved hypothetical protein;pseudo= 
NC_009925.1 feature gene 1627 2319 . - . db_xref=GeneID:5678840;locus_tag=AM1_0003 

問題就出在第三和第四行:它需要完整的包頭信息從.gbk中放入並作爲一行,而應該跳過它。最後兩行是正確的(輸出文件的其餘部分也是如此)。我試過使用幾個不同的.gbk文件,都產生相同的結果。

僅供參考,這裏的.gbk文件的開頭:

LOCUS  NC_009925   6503724 bp DNA  circular CON 10-JUN-2013 
DEFINITION Acaryochloris marina MBIC11017 chromosome, complete genome. 
ACCESSION NC_009925 
VERSION  NC_009925.1 GI:158333233 
DBLINK  Project: 58167 
      BioProject: PRJNA58167 
KEYWORDS . 
SOURCE  Acaryochloris marina MBIC11017 
    ORGANISM Acaryochloris marina MBIC11017 
      Bacteria; Cyanobacteria; Oscillatoriophycideae; Chroococcales; 
      Acaryochloris. 
REFERENCE 1 (bases 1 to 6503724) 
    AUTHORS Swingley,W.D., Chen,M., Cheung,P.C., Conrad,A.L., Dejesa,L.C., 
      Hao,J., Honchak,B.M., Karbach,L.E., Kurdoglu,A., Lahiri,S., 
      Mastrian,S.D., Miyashita,H., Page,L., Ramakrishna,P., Satoh,S., 
      Sattley,W.M., Shimada,Y., Taylor,H.L., Tomo,T., Tsuchiya,T., 
      Wang,Z.T., Raymond,J., Mimuro,M., Blankenship,R.E. and 
      Touchman,J.W. 
    TITLE  Niche adaptation and genome expansion in the chlorophyll 
      d-producing cyanobacterium Acaryochloris marina 
    JOURNAL Proc. Natl. Acad. Sci. U.S.A. 105 (6), 2005-2010 (2008) 
    PUBMED 18252824 
REFERENCE 2 (bases 1 to 6503724) 
    CONSRTM NCBI Genome Project 
    TITLE  Direct Submission 
    JOURNAL Submitted (17-OCT-2007) National Center for Biotechnology 
      Information, NIH, Bethesda, MD 20894, USA 
REFERENCE 3 (bases 1 to 6503724) 
    AUTHORS Touchman,J.W. 
    TITLE  Direct Submission 
    JOURNAL Submitted (27-AUG-2007) Pharmaceutical Genomics Division, 
      Translational Genomics Research Institute, 13208 E Shea Blvd, 
      Scottsdale, AZ 85004, USA 
COMMENT  PROVISIONAL REFSEQ: This record has not yet been subject to final 
      NCBI review. The reference sequence was derived from CP000828. 
      Source bacteria from Marine Biotechnology Institute Culture 
      Collection, Marine Biotechnology Institute, 3-75-1 Heita, Kamaishi, 
      Iwate 026-0001, Japan. 
      COMPLETENESS: full length. 
FEATURES    Location/Qualifiers 
    source   1..6503724 
        /organism="Acaryochloris marina MBIC11017" 
        /mol_type="genomic DNA" 
        /strain="MBIC11017" 
        /db_xref="taxon:329726" 
        /note="type strain of Acaryochloris marina" 
    gene   complement(931..1581) 
        /locus_tag="AM1_0001" 
        /note="conserved hypothetical protein" 
        /pseudo 
        /db_xref="GeneID:5685235" 
    gene   complement(1627..2319) 
        /locus_tag="AM1_0003" 
        /db_xref="GeneID:5678840" 
    CDS    complement(1627..2319) 
        /locus_tag="AM1_0003" 
        /codon_start=1 
        /transl_table=11 
        /product="NUDIX hydrolase" 
         /protein_id="YP_001514406.1" 
        /db_xref="GI:158333234" 
        /db_xref="GeneID:5678840" 
        /translation="MPYTYDYPRPGLTVDCVVFGLDEQIDLKVLLIQRQIPPFQHQWA 
       LPGGFVQMDESLEDAARRELREETGVQGIFLEQLYTFGDLGRDPRDRIISVAYYALIN 
       LIEYPLQASTDAEDAAWYSIENLPSLAFDHAQILKQAIRRLQGKVRYEPIGFELLPQK 
       FTLTQIQQLYETVLGHPLDKRNFRKKLLKMDLLIPLDEQQTGVAHRAARLYQFDQSKY 
       ELLKQQGFNFEV" 

有誰知道我怎麼能解決這個問題?

我用下面的行過濾掉前兩錯行:

if "\tannotation\t" in line or "feature\tsource" in line: 

這似乎在幾個測試.gbk的工作。但我仍然好奇爲什麼它首先解析那些人?

回答

1

答案在您鏈接的wiki頁面(http://biopython.org/wiki/GFF_Parsing#Writing_GFF3)。 「GFF3Writer接收SeqRecord對象的迭代器,並將每個SeqFeature寫入GFF3行」。從.gbk文件解析的SeqRecord對象包含此註釋,因此它由作者編寫。在實施(https://github.com/chapmanb/bcbb/blob/master/gff/BCBio/GFF/GFFOutput.py),你可以看到它完成:

self._write_annotations(rec.annotations, rec.id, len(rec.seq), out_handle) 

在那裏,你也可以看到爲什麼source功能傳遞。這只是其他功能(基因,CDS)而沒有單獨處理。

我不知道爲什麼沒有選項或參數(至少我沒有找到任何)告訴作者跳過註釋。我不知道有任何參數可以在SeqRecordsSeqIO.parse()之間跳過註釋。

爲了解決您的問題,我單獨訪問解析的SeqRecords,刪除註釋和源功能。這種方法的一個缺點是需要額外的內存(以及性能損失),因爲我從初始生成器創建了一個List。最後,我只是將列表解析爲GFF。我不知道這種方法比你的好多了。

#!/usr/bin/env python 
from BCBio import GFF 
from Bio import SeqIO 

def convert_to_GFF3(): 
    in_file = "input.gbk" 
    out_file = "output.gff" 
    in_handle = open(in_file) 
    out_handle = open(out_file, "w") 

    records = [] 
    for record in SeqIO.parse(in_handle, "genbank"): 
     # delete annotations 
     record.annotations = {} 
     # loop through features to find the source 
     for i in range(0,len(record.features)): 
      # if found, delete it and stop (only expect one source) 
      if(record.features[i].type == "source"): 
       record.features.pop(i) 
       break 
     records.append(record) 

    GFF.write(records, out_handle) 

    in_handle.close() 
    out_handle.close() 

convert_to_GFF3()