2015-10-07 115 views
1

我想就如何解析這個文件Gene ontology (.obo)解析文件到父/子格式的JSON文件

我正在努力創造D3可視化,並需要建立一個「樹一些幫助/諮詢「文件,以JSON格式 -

{ 
"name": "flare", 
"description": "flare", 
"children": [ 
    { 
    "name": "analytic", 
    "description": "analytics", 
    "children": [ 
    { 
    "name": "cluster", 
    "description": "cluster", 
    "children": [ 
     {"name": "Agglomer", "description": "AgglomerativeCluster", "size": 3938}, 
     {"name": "Communit", "description": "CommunityStructure", "size": 3812}, 
     {"name": "Hierarch", "description": "HierarchicalCluster", "size": 6714}, 
     {"name": "MergeEdg", "description": "MergeEdge", "size": 743} 
    ] 
    }, etc.. 

這種格式似乎很容易在Python中的字典進行復制,每個條目一個3個字段:名稱,描述和兒童[]。

我的問題實際上是如何提取數據。上面鏈接的文件具有「對象」結構爲:

[Term] 
id: GO:0000001 
name: mitochondrion inheritance 
namespace: biological_process 
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764] 
synonym: "mitochondrial inheritance" EXACT [] 
is_a: GO:0048308 ! organelle inheritance 
is_a: GO:0048311 ! mitochondrion distribution 

我在哪裏需要id,is_a和name字段。我試圖用python來解析這個,但我似乎找不到找到每個對象的方法。

任何想法?

回答

1

下面是解析'.obo'文件中的對象的一種相當簡單的方法。它將對象數據保存爲dict,其中id作爲關鍵字,nameis_a數據保存在列表中。然後使用標準的json模塊的.dumps函數進行漂亮的打印。

出於測試目的,我在鏈接中使用了截斷版本的文件,該文件最多隻包含id: GO:0000006

此代碼忽略任何包含is_obsolete字段的對象。它還從is_a字段中刪除描述信息;我想你可能想要,但很容易禁用該功能。

#!/usr/bin/env python 

''' Parse object data from a .obo file 

    From http://stackoverflow.com/q/32989776/4014959 

    Written by PM 2Ring 2015.10.07 
''' 

from __future__ import print_function, division 

import json 
from collections import defaultdict 

fname = "go-basic.obo" 
term_head = "[Term]" 

#Keep the desired object data here 
all_objects = {} 

def add_object(d): 
    #print(json.dumps(d, indent = 4) + '\n') 
    #Ignore obsolete objects 
    if "is_obsolete" in d: 
     return 

    #Gather desired data into a single list, 
    # and store it in the main all_objects dict 
    key = d["id"][0] 
    is_a = d["is_a"] 
    #Remove the next line if you want to keep the is_a description info 
    is_a = [s.partition(' ! ')[0] for s in is_a] 
    all_objects[key] = d["name"] + is_a 


#A temporary dict to hold object data 
current = defaultdict(list) 

with open(fname) as f: 
    #Skip header data 
    for line in f: 
     if line.rstrip() == term_head: 
      break 

    for line in f: 
     line = line.rstrip() 
     if not line: 
      #ignore blank lines 
      continue 
     if line == term_head: 
      #end of term 
      add_object(current) 
      current = defaultdict(list) 
     else: 
      #accumulate object data into current 
      key, _, val = line.partition(": ") 
      current[key].append(val) 

if current: 
    add_object(current)  

print("\nall_objects =") 
print(json.dumps(all_objects, indent = 4, sort_keys=True)) 

輸出

all_objects = 
{ 
    "GO:0000001": [ 
     "mitochondrion inheritance", 
     "GO:0048308", 
     "GO:0048311" 
    ], 
    "GO:0000002": [ 
     "mitochondrial genome maintenance", 
     "GO:0007005" 
    ], 
    "GO:0000003": [ 
     "reproduction", 
     "GO:0008150" 
    ], 
    "GO:0000006": [ 
     "high-affinity zinc uptake transmembrane transporter activity", 
     "GO:0005385" 
    ] 
}