更快的方式來加載json

我有網站日誌保存爲json，我想加載它們在熊貓。我有這樣的JSON結構，具有多重嵌套數據：更快的方式來加載json

{"settings":{"siteIdentifier":"site1"}, 
    "event":{"name":"pageview", 
      "properties":[]}, 
    "context":{"date":"Thu Dec 01 2016 01:00:08 GMT+0100 (CET)", 
       "location":{"hash":"", 
          "host":"aaa"}, 
       "screen":{"availHeight":876, 
         "orientation":{"angle":0, 
             "type":"landscape-primary"}}, 
       "navigator":{"appCodeName":"Mozilla", 
          "vendorSub":""}, 
       "visitor":{"id": "unique_id"}}, 
    "server":{"HTTP_COOKIE":"uid", 
       "date":"2016-12-01T00:00:09+00:00"}} 
{"settings":{"siteIdentifier":"site2"}, 
    "event":{"name":"pageview", 
      "properties":[]}, 
    "context":{"date":"Thu Dec 01 2016 01:00:10 GMT+0100 (CET)", 
       "location":{"hash":"", 
          "host":"aaa"}, 
       "screen":{"availHeight":852, 
         "orientation":{"angle":90, 
             "type":"landscape-primary"}}, 
       "navigator":{"appCodeName":"Mozilla", 
          "vendorSub":""}, 
       "visitor":{"id": "unique_id"}}, 
    "server":{"HTTP_COOKIE":"uid", 
       "date":"2016-12-01T00:00:09+00:10"}}

現在唯一的工作解決方法是：

import pandas as pd 
import json 
from pandas.io.json import json_normalize 
pd.set_option('expand_frame_repr', False) 
pd.set_option('display.max_columns', 10) 
pd.set_option("display.max_rows",30) 

first = True 
filename = "/path/to/file.json" 
with open(filename, 'r') as f: 
    for line in f: # read line by line to retrieve only one json 
     data = json.loads(line) # convert single json from string to json 
     if first: # initialize the dataframe 
      df = json_normalize(data) 
      first = False 
     else: # add a row for each json 
      df=df.append(json_normalize(data)) #normalize to flatten the data 
df.to_csv("2016-12-02.csv",index=False, encoding='utf-8')

我有行，因爲我jsons只是貼上一個讀線在另一個之後而不在列表中。我的代碼正在運行，但速度非常慢。我能做些什麼來改善它？我使用熊貓，因爲它看起來合適，但如果有另一種方式，那就沒問題。

來源

2016-12-30 CoMartel

你可以把所有的JSON對象到一個單一的迭代第一：

with open(filename, 'r') as f: 
    data = [json.loads(line) for line in f] 
    df = json_normalize(data) 
df.to_csv("2016-12-02.csv",index=False, encoding='utf-8')

來源

2016-12-30 10:53:32

我得到一個錯誤「DF = json_normalize（數據）」：「類型錯誤：‘發電機’對象有沒有屬性‘__getitem__’ 「 – CoMartel

@HarryPotfleur好吧，不知道會發生什麼，因爲我不使用熊貓，所以我想先嚐試稍微更高效的選項。我編輯它，它現在是方括號表示一個列表理解，而不是一個生成器。 –

我剛測試過它，速度驚人得多！非常感謝！ – CoMartel

更快的方式來加載json

回答

相關問題