使用pandas數據框中的json對象優化解析文件

我有一段和下面描述的代碼的和平，執行時間約爲5秒，對於1000行文件來說相當長，所以我正在尋找優化方法，但我不知道如何改進現有版本。使用pandas數據框中的json對象優化解析文件

我有一個大的文件，包含在每行有效的JSON，每個JSON看起來像（真實的數據更大型，嵌套，所以JSON的這種和平將顯示爲說明只是）：

{"location":{"town":"Rome","groupe":"Advanced", 
    "school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}}, 
    "id":"145", 
    "Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2, 
    "Father":{"FatherName":"Peter","FatherAge":"51"}, 
    "Teacher":["MrCrock","MrDaniel"],"Field":"Marketing", 
    "season":["summer","spring"]}

我需要解析這個文件，以從每一個JSON只提取了一些鍵值，獲取應該是一個數據幀：

Groupe  Id MotherName FatherName 
Advanced 56 Laure   James 
Middle  11 Ann   Nicolas 
Advanced 6 Helen   Franc

但一些關鍵，我需要在數據幀，在一些失蹤json對象，所以我應該驗證密鑰是否存在，否則用空值填充相應的值。我繼續以下方法：

df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother']) 
with open (path/to/file) as f: 
     for chunk in f: 
      jfile=json.loads(chunk) 
      if 'groupe' in jfile['location']: 
       groupe=jfile['location']['groupe'] 
      else: 
       groupe=np.nan 
      if 'id' in jfile: 
       id=jfile['id'] 
      else: 
       id=np.nan 
      if 'MotherName' in jfile['Mother']: 
       MotherName=jfile['Mother']['MotherName'] 
      else: 
       MotherName=np.nan 
      if 'FatherName' in jfile['Father']: 
       FatherName=jfile['Father']['FatherName'] 
      else: 
       FatherName=np.nan 
      df = df.append({"groupe":group,"id":id,"MotherName":MotherName,"FatherName":FatherName}, 
      ignore_index=True)

我需要優化1000行整個文件的執行時間至少2秒。在perl中，相同的解析函數只需不到1秒，但我需要在Python中實現它。

來源

2016-02-26 Amanda

如果您可以在初始化期間的單個步驟中構建數據幀，您將獲得最佳性能。 DataFrame.from_record需要一系列元組，您可以從一次讀取一條記錄的發生器提供這些元組。您可以使用get更快地解析數據，當找不到該項目時它將提供默認參數。我創建了一個空的dict，調用dummy來傳遞中間值get，這樣就可以知道鏈接獲取會起作用。

我創建了1000條記錄數據集，在我的蹩腳筆記本電腦上，時間從18秒變爲0.06秒。這很不錯。

import numpy as np 
import pandas as pd 
import json 
import time 

def extract_data(data): 
    """ convert 1 json dict to records for import""" 
    dummy = {} 
    jfile = json.loads(data.strip()) 
    return (
     jfile.get('location', dummy).get('groupe', np.nan), 
     jfile.get('id', np.nan), 
     jfile.get('Mother', dummy).get('MotherName', np.nan), 
     jfile.get('Father', dummy).get('FatherName', np.nan)) 

start = time.time() 
df = pd.DataFrame.from_records(map(extract_data, open('file.json')), 
    columns=['group', 'id', 'Father', 'Mother']) 
print('New algorithm', time.time()-start) 

# 
# The original way 
# 

start= time.time() 
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother']) 
with open ('file.json') as f: 
     for chunk in f: 
      jfile=json.loads(chunk) 
      if 'groupe' in jfile['location']: 
       groupe=jfile['location']['groupe'] 
      else: 
       groupe=np.nan 
      if 'id' in jfile: 
       id=jfile['id'] 
      else: 
       id=np.nan 
      if 'MotherName' in jfile['Mother']: 
       MotherName=jfile['Mother']['MotherName'] 
      else: 
       MotherName=np.nan 
      if 'FatherName' in jfile['Father']: 
       FatherName=jfile['Father']['FatherName'] 
      else: 
       FatherName=np.nan 
      df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName}, 
      ignore_index=True) 
print('original', time.time()-start)

來源

2016-02-26 07:21:05 tdelaney

我有'AttributeError：'列表'對象沒有屬性'get''與這種方法！不要忘了我每行都有一個json的文件，也許這是一個問題。所以我需要遍歷行來解析每個json – Amanda

，這樣整個文件就不是json本身，但是這個文件的每一行都是有效的json – Amanda

它的工作原理除了不是字典而是嵌套的json！在這種情況下如何使用.get方法？ @tdelaney – Amanda

關鍵部分不是將每行添加到循環中的數據幀。您希望將集合保存在列表或字典容器中，然後將它們一次連接起來。您還可以使用簡單的get簡化您的if/else結構，如果在字典中未找到該項目，該結構將返回默認值（例如np.nan）。

with open (path/to/file) as f: 
    d = {'group': [], 'id': [], 'Father': [], 'Mother': []} 
    for chunk in f: 
     jfile = json.loads(chunk) 
     d['groupe'].append(jfile['location'].get('groupe', np.nan)) 
     d['id'].append(jfile.get('id', np.nan)) 
     d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan)) 
     d['FatherName'].append(jfile['Father'].get('FatherName', np.nan)) 

    df = pd.DataFrame(d)

來源

2016-02-26 06:39:58 Alexander

你的回答是不錯的n，而是有一個錯誤'類型錯誤：列表索引必須是整數，而轉換成字典大熊貓數據幀 – Amanda

聽起來有可能是與數據問題而無法str'。嘗試從每列創建一個DataFrame，並查看是否可以找出問題。 – Alexander

使用pandas數據框中的json對象優化解析文件

回答

相關問題