2017-10-05 34 views
0

由於處理語料庫(使用GNRD http://gnrd.globalnames.org/進行科學名稱提取),我不得不創建多個JSON文件。我現在想要使用這些JSON文件來將所述語料庫註釋爲一個整體。如何在Python中合併多個JSON文件

我想在Python中合併多個JSON文件。每個JSON文件的內容都是scientific_name(鍵)的數組和名稱(值)的數組。下面是較短的一個文件的一個例子:

{ 
    "file":"biodiversity_trophic_9.txt", 
    "names":[ 
    { 
     "scientificName":"Bufo" 
    }, 
    { 
     "scientificName":"Eleutherodactylus talamancae" 
    }, 
    { 
     "scientificName":"E. punctariolus" 
    }, 
    { 
     "scientificName":"Norops lionotus" 
    }, 
    { 
     "scientificName":"Centrolenella prosoblepon" 
    }, 
    { 
     "scientificName":"Sibon annulatus" 
    }, 
    { 
     "scientificName":"Colostethus flotator" 
    }, 
    { 
     "scientificName":"C. inguinalis" 
    }, 
    { 
     "scientificName":"Eleutherodactylus" 
    }, 
    { 
     "scientificName":"Hyla columba" 
    }, 
    { 
     "scientificName":"Bufo haematiticus" 
    }, 
    { 
     "scientificName":"S. annulatus" 
    }, 
    { 
     "scientificName":"Leptodeira septentrionalis" 
    }, 
    { 
     "scientificName":"Imantodes cenchoa" 
    }, 
    { 
     "scientificName":"Oxybelis brevirostris" 
    }, 
    { 
     "scientificName":"Cressa" 
    }, 
    { 
     "scientificName":"Coloma" 
    }, 
    { 
     "scientificName":"Perlidae" 
    }, 
    { 
     "scientificName":"Hydropsychidae" 
    }, 
    { 
     "scientificName":"Hyla" 
    }, 
    { 
     "scientificName":"Norops" 
    }, 
    { 
     "scientificName":"Hyla colymbiphyllum" 
    }, 
    { 
     "scientificName":"Colostethus inguinalis" 
    }, 
    { 
     "scientificName":"Oxybelis" 
    }, 
    { 
     "scientificName":"Rana warszewitschii" 
    }, 
    { 
     "scientificName":"R. warszewitschii" 
    }, 
    { 
     "scientificName":"Rhyacophilidae" 
    }, 
    { 
     "scientificName":"Daphnia magna" 
    }, 
    { 
     "scientificName":"Hyla colymba" 
    }, 
    { 
     "scientificName":"Centrolenella" 
    }, 
    { 
     "scientificName":"Orconectes nais" 
    }, 
    { 
     "scientificName":"Orconectes neglectus" 
    }, 
    { 
     "scientificName":"Campostoma anomalum" 
    }, 
    { 
     "scientificName":"Caridina" 
    }, 
    { 
     "scientificName":"Decapoda" 
    }, 
    { 
     "scientificName":"Atyidae" 
    }, 
    { 
     "scientificName":"Cerastoderma edule" 
    }, 
    { 
     "scientificName":"Rana aurora" 
    }, 
    { 
     "scientificName":"Riffle" 
    }, 
    { 
     "scientificName":"Calopterygidae" 
    }, 
    { 
     "scientificName":"Elmidae" 
    }, 
    { 
     "scientificName":"Gyrinidae" 
    }, 
    { 
     "scientificName":"Gerridae" 
    }, 
    { 
     "scientificName":"Naucoridae" 
    }, 
    { 
     "scientificName":"Oligochaeta" 
    }, 
    { 
     "scientificName":"Veliidae" 
    }, 
    { 
     "scientificName":"Libellulidae" 
    }, 
    { 
     "scientificName":"Philopotamidae" 
    }, 
    { 
     "scientificName":"Ephemeroptera" 
    }, 
    { 
     "scientificName":"Psephenidae" 
    }, 
    { 
     "scientificName":"Baetidae" 
    }, 
    { 
     "scientificName":"Corduliidae" 
    }, 
    { 
     "scientificName":"Zygoptera" 
    }, 
    { 
     "scientificName":"B. buto" 
    }, 
    { 
     "scientificName":"C. euknemos" 
    }, 
    { 
     "scientificName":"C. ilex" 
    }, 
    { 
     "scientificName":"E. padi noblei" 
    }, 
    { 
     "scientificName":"E. padi" 
    }, 
    { 
     "scientificName":"E. bufo" 
    }, 
    { 
     "scientificName":"E. butoni" 
    }, 
    { 
     "scientificName":"E. crassi" 
    }, 
    { 
     "scientificName":"E. cruentus" 
    }, 
    { 
     "scientificName":"H. colymbiphyllum" 
    }, 
    { 
     "scientificName":"N. aterina" 
    }, 
    { 
     "scientificName":"S. ilex" 
    }, 
    { 
     "scientificName":"Anisoptera" 
    }, 
    { 
     "scientificName":"Riffle delta" 
    } 
    ], 
    "total":67, 
    "status":200, 
    "unique":true, 
    "engines":[ 
    "TaxonFinder", 
    "NetiNeti" 
    ], 
    "verbatim":false, 
    "input_url":null, 
    "token_url":"http://gnrd.globalnames.org/name_finder.html?token=2rtc4e70st", 
    "parameters":{ 
    "engine":0, 
    "return_content":false, 
    "best_match_only":false, 
    "data_source_ids":[ 

    ], 
    "detect_language":true, 
    "all_data_sources":false, 
    "preferred_data_sources":[ 

    ] 
    }, 
    "execution_time":{ 
    "total_duration":3.1727607250213623, 
    "find_names_duration":1.9656541347503662, 
    "text_preparation_duration":1.000107765197754 
    }, 
    "english_detected":true 
} 

我的問題是,有可能是整個文件,這是我想刪除(否則我可能正好連接我猜的文件)複製。我見過的查詢是指合併額外的鍵和值來擴展數組本身。

任何人都可以指導我如何解決這個問題嗎?

+3

裝入JSON文件到蟒蛇,如無論對象類型代表他們。然後使用你需要的任何邏輯來合併這些對象(沒有通用的'請合併這些'規則,你需要確定合理的東西是什麼意思/結果對象應該是什麼樣的)。然後將該合併對象序列化回Json。 – Bilkokuya

+1

你能舉一個預期結果的例子嗎? – Don

+0

感謝您的意見。預期的結果是將所有文件合併爲一個文件,最好刪除任何重複的文件,因爲我認爲重複文件可能會導致我想在語料庫上執行的後續註釋中出現問題。所有的文件都與上面描述的一樣,但是有15個文件,每個文件的介紹和結尾都是條目數量,搜索時間等。最好先從每個文件中手動刪除它? –

回答

0

如果我理解正確,您希望在一批文件的「名稱」元素中獲取所有「scientificNames」值。如果我錯了,你應該給出一個預期的輸出,使事情更容易理解。

我會做這樣的事情:

all_names = set() # use a set to avoid duplicates 

# put all your files in there 
for filename in ('file1.json', 'file2.json', ....): 
    try: 
     with open(filename, 'rt') as finput: 
      data = json.load(finput) 
     for name in data.get('names'): 
      all_names.add(name.get('scientificName') 
    except Exception as exc: 
     print("Skipped file {} because exception {}".format(filename, str(exc)) 

print(all_names) 

而如果你想獲得比初始文件類似的格式,添加:

import pprint 
pprint({"names:": {"scientificName": name for name in all_names}, "total": len(all_names)})