Python：將兩個CSV文件合併爲多級JSON

我對Python/JSON很新，所以請耐心等待。我可以在R中執行此操作，但我們需要使用Python以將其轉換爲Python/Spark/MongoDB。此外，我只是發佈一個最小的子集 - 我有更多的文件類型，所以如果有人可以幫助我，我可以在此基礎上整合更多文件和文件類型：Python：將兩個CSV文件合併爲多級JSON

回到我的問題：

我有兩個tsv輸入文件，我需要合併並轉換爲JSON。這兩個文件都有基因和樣本列以及一些附加列。然而，gene和sample可能會或可能不會重疊，如我所示 - f2.tsv具有f1.tsv中的所有基因，但也有一個額外的基因g3。同樣，這兩個文件在sample列中都有重疊以及不重疊的值。

# f1.tsv – has gene, sample and additional column other1 

$ cat f1.tsv 
gene sample other1 
g1  s1  a1 
g1  s2  b1 
g1  s3a  c1 
g2  s4  d1 

# f2.tsv – has gene, sample and additional columns other21, other22 

$ cat f2.tsv 
gene sample other21 other22 
g1  s1  a21  a22 
g1  s2  b21  b22 
g1  s3b  c21  c22 
g2  s4  d21  d22 
g3  s5  f21  f22

該基因形成的頂層，每個基因具有形成第二級和其他列形成extras這是第三級的多個樣品。附加內容分爲兩部分，因爲一個文件有other1，第二個文件有other21和other22。稍後我將包含的其他文件將包含其他字段，如other31和other32等，但它們仍將具有基因和樣本列。

# expected output – JSON by combining both tsv files. 
$ cat output.json 
[{ 
    "gene":"g1", 
    "samples":[ 
    { 
     "sample":"s2", 
     "extras":[ 
     { 
      "other1":"b1" 
     }, 
     { 
      "other21":"b21", 
      "other22":"b22" 
     } 
     ] 
    }, 
    { 
     "sample":"s1", 
     "extras":[ 
     { 
      "other1":"a1" 
     }, 
     { 
      "other21":"a21", 
      "other22":"a22" 
     } 
     ] 
    }, 
    { 
     "sample":"s3b", 
     "extras":[ 
     { 
      "other21":"c21", 
      "other22":"c22" 
     } 
     ] 
    }, 
    { 
     "sample":"s3a", 
     "extras":[ 
     { 
      "other1":"c1" 
     } 
     ] 
    } 
    ] 
},{ 
    "gene":"g2", 
    "samples":[ 
    { 
     "sample":"s4", 
     "extras":[ 
     { 
      "other1":"d1" 
     }, 
     { 
      "other21":"d21", 
      "other22":"d22" 
     } 
     ] 
    } 
    ] 
},{ 
    "gene":"g3", 
    "samples":[ 
    { 
     "sample":"s5", 
     "extras":[ 
     { 
      "other21":"f21", 
      "other22":"f22" 
     } 
     ] 
    } 
    ] 
}]

如何將兩個csv文件轉換爲基於兩個公共列的單一多級JSON？

我真的很感激任何幫助，我可以得到這一點。

謝謝！

來源

2016-08-19 Komal Rathi

這裏的另一種選擇方式。當您開始添加更多文件時，我試圖使其易於管理。您可以在命令行上運行併爲每個要添加的文件提供參數。基因/樣本名稱存儲在字典中以提高效率。你想要的JSON對象的格式是在每個類的format（）方法中完成的。希望這可以幫助。

import csv, json, sys 

class Sample(object): 
    def __init__(self, name, extras): 
     self.name = name 
     self.extras = [extras] 

    def format(self): 
     map = {} 
     map['sample'] = self.name 
     map['extras'] = self.extras 
     return map 

    def add_extras(self, extras): 
     #edit 8/20 
     #always just add the new extras to the list 
     for extra in extras: 
      self.extras.append(extra) 

class Gene(object): 
    def __init__(self, name, samples): 
     self.name = name 
     self.samples = samples 

    def format(self): 
     map = {} 
     map ['gene'] = self.name 
     map['samples'] = sorted([self.samples[sample_key].format() for sample_key in self.samples], key=lambda sample: sample['sample']) 
     return map 

    def create_or_add_samples(self, new_samples): 
     # loop through new samples, seeing if they already exist in the gene object 
     for sample_name in new_samples: 
      sample = new_samples[sample_name] 
      if sample.name in self.samples: 
       self.samples[sample.name].add_extras(sample.extras) 
      else: 
       self.samples[sample.name] = sample 

class Genes(object): 
    def __init__(self): 
     self.genes = {} 

    def format(self): 
     return sorted([self.genes[gene_name].format() for gene_name in self.genes], key=lambda gene: gene['gene']) 

    def create_or_add_gene(self, gene): 
     if not gene.name in self.genes: 
      self.genes[gene.name] = gene 
     else: 
      self.genes[gene.name].create_or_add_samples(gene.samples) 

def row_to_gene(headers, row): 
    gene_name = "" 
    sample_name = "" 
    extras = {} 
    for value in enumerate(row): 
     if headers[value[0]] == "gene": 
      gene_name = value[1] 
     elif headers[value[0]] == "sample": 
      sample_name = value[1] 
     else: 
      extras[headers[value[0]]] = value[1] 
    sample_dict = {} 
    sample_dict[sample_name] = Sample(sample_name, extras) 
    return Gene(gene_name, sample_dict) 

if __name__ == '__main__': 
    delim = "\t" 
    genes = Genes() 
    files = sys.argv[1:] 

    for file in files: 
     print("Reading " + str(file)) 
     with open(file,'r') as f1: 
      reader = csv.reader(f1, delimiter=delim) 
      headers = [] 
      for row in reader: 
       if len(headers) == 0: 
        headers = row 
       else: 
        genes.create_or_add_gene(row_to_gene(headers, row)) 

    result = json.dumps(genes.format(), indent=4) 
    print(result) 
    with open('json_output.txt', 'w') as output: 
     output.write(result)

來源

2016-08-19 17:53:44 gregbert

它工作得很好 - 我真的很喜歡你有它如此普遍 - 我可以指定分隔符以及任何數量的文件。這難以置信！ –

我只有一個問題 - 對於G1/S1它顯示了''' 「羣衆演員」： { 「其他1」：「A1」 }， [ { 「other22」：「A22」，「other21 「：」a21「 } ] ]'''我想刪除額外的內部方括號。 –

@KomalRathi哎呀，對不起。我編輯修復 – gregbert

這看起來像是pandas的問題！不幸的是，熊貓只能把我們帶到目前爲止，然後我們必須自己做一些操作。這既不是快速也不是特別有效的代碼，但它會完成工作。

import pandas as pd 
import json 
from collections import defaultdict 

# here we import the tsv files as pandas df 
f1 = pd.read_table('f1.tsv', delim_whitespace=True) 
f2 = pd.read_table('f2.tsv', delim_whitespace=True) 

# we then let pandas merge them 
newframe = f1.merge(f2, how='outer', on=['gene', 'sample']) 

# have pandas write them out to a json, and then read them back in as a 
# python object (a list of dicts) 
pythonList = json.loads(newframe.to_json(orient='records')) 


newDict = {} 
for d in pythonList: 
    gene = d['gene'] 
    sample = d['sample'] 
    sampleDict = {'sample':sample, 
        'extras':[]} 

    extrasdict = defaultdict(lambda:dict()) 

    if gene not in newDict: 
     newDict[gene] = {'gene':gene, 'samples':[]} 

    for key, value in d.iteritems(): 
     if 'other' not in key or value is None: 
      continue 
     else: 
      id = key.split('other')[-1] 
      if len(id) == 1: 
       extrasdict['1'][key] = value 
      else: 
       extrasdict['{}'.format(id[0])][key] = value 

    for value in extrasdict.values(): 
     sampleDict['extras'].append(value) 

    newDict[gene]['samples'].append(sampleDict) 

newList = [v for k, v in newDict.iteritems()] 

print json.dumps(newList)

如果這看起來像一個解決方案，爲你工作，我很樂意花一些時間來清除它，使它誘餌更具可讀性和效率。

PS：如果你喜歡R，那麼大熊貓是去（這是寫給Python中的R類似的接口數據）

來源

2016-08-19 17:18:22

該解決方案是完美的！我剛接受gregbert的回答，因爲他的代碼具有指定儘可能多的輸入文件和分隔符的功能。非常感謝你。 –

做，在步驟：

讀取傳入tsv文件和聚集來自不同基因的信息到字典中。
處理所述字典以匹配您所需的格式。
將結果寫入JSON文件。

下面是代碼：

import csv 
import json 
from collections import defaultdict 

input_files = ['f1.tsv', 'f2.tsv'] 
output_file = 'genes.json' 

# Step 1 
gene_dict = defaultdict(lambda: defaultdict(list)) 
for file in input_files: 
    with open(file, 'r') as f: 
     reader = csv.DictReader(f, delimiter='\t') 
     for line in reader: 
      gene = line.pop('gene') 
      sample = line.pop('sample') 
      gene_dict[gene][sample].append(line) 

# Step 2 
out = [{'gene': gene, 
     'samples': [{'sample': sample, 'extras': extras} 
        for sample, extras in samples.items()]} 
     for gene, samples in gene_dict.items()] 

# Step 3 
with open(output_file, 'w') as f: 
    json.dump(out, f)

來源

2016-08-19 18:01:09

這個解決方案是完美的！我剛接受gregbert的回答，因爲他的代碼具有指定儘可能多的輸入文件和分隔符的功能。非常感謝你。 –

請注意，這些代碼也很容易處理我的代碼：添加更多輸入文件，將它們的名稱追加到'input_files'列表中;要更改分隔符，請編輯第12行。 –

Python：將兩個CSV文件合併爲多級JSON

回答

相關問題