如何優化閱讀和處理大文件？

我有一個腳本，可以將一些可憐的人從API返回的數據緩存爲JSON對象的平面文件。一個結果/每行JSON對象。如何優化閱讀和處理大文件？

緩存工作流程如下：

閱讀在整個緩存文件 - >檢查每一行是太舊了，一行行 - >保存是不是太舊到新列表中的 - >將新的新緩存列表打印到文件中，並將新列表用作篩選器，以便不針對API調用的傳入數據。

到目前爲止，這個過程的最長的部分是粗體上面。以下是代碼：

print "Reading cache file into memory ---" 
with open('cache', 'r') as f: 
    cache_lines = f.readlines() 

print "Turning cache lines into json and checking if they are stale or not ---" 
for line in cache_lines 
    # Load the line back up as a json object 
    try: 
     json_line = json.loads(line) 
    except Exception as e: 
     print e 

    # Get the delta to determine if data is stale. 
    delta = meta_dict["timestamp_start"] - parser.parse(json_line['timestamp_start']) 

    # If the data is still fresh then hold onto it 
    if cache_timeout >= delta: 
     fresh_cache.append(json_line)

根據散列文件的大小可能需要幾分鐘。有沒有更快的方法來做到這一點？我理解，閱讀整個文件並不理想，但最容易實現。

來源

2015-12-21 Thisisstackoverflow

根據您的文件大小，它可能會導致內存問題。我不知道這是否是你遇到的問題。上面的代碼可以改寫如下：

delta = meta_dict['timestamp_start'] 
with open('cache', 'r') as f: 
    while True: 
     line = f.readline() 
     if not line: 
      break 
     line = json.loads(line) 
     if delta - parser.parse(line['timestamp_start']) <= cache_timeout: 
      fresh_cache.append(json_line)

此外，

沒有，如果你使用dateutils解析日期，每次通話可能是昂貴的。如果您的格式是已知的，可能要使用由datetime或dateutils
提供的標準轉換工具，如果你的文件是真正的大和fresh_cache必須是真正的大，你可以使用另一個with上的中間文件寫新鮮項聲明。

來源

2015-12-21 21:40:34 ohe

感謝您的意見。我希望有一些黑魔法，但看起來我運氣不好。我會盡量不parser.parsing每個電話，看看是否有幫助。 – Thisisstackoverflow

你也可以嘗試'simplejson'庫，它比標準的'json'庫更快... – ohe

好點。也是一個鏡頭。 – Thisisstackoverflow

回報 - 1. simplejson幾乎沒有效果。 2.做手動日期時間提取有很大的作用。從8m11.578s減少到2m55.681s，減少了。這取代了上面的parser.parse 行： datetime.datetime.strptime（json_line ['timestamp_start']，'％Y-％m-％d ％H：％M：％S.％f「） -

來源

2015-12-21 23:31:07 Thisisstackoverflow

如何優化閱讀和處理大文件？

回答

相關問題