Python中的MemoryError：我如何優化我的代碼？

我有大量的json文件要結合並輸出爲單個csv（加載到R），每個json文件約爲1.5GB。在每個250mb的4-5個json文件上進行試驗時，我會在下面看到以下錯誤。我在8gb ram和Windows 7 professional 64位上運行Python版本'2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'。Python中的MemoryError：我如何優化我的代碼？

我是一名Python新手，在編寫優化代碼方面經驗不多，並且非常感謝如何優化我的腳本。謝謝！

=======的Python的MemoryError =======

Traceback (most recent call last): 
    File "C:\Users\...\tweetjson_to_csv.py", line 52, in <module> 
    for line in file: 
MemoryError 
[Finished in 29.5s]

======= JSON到CSV轉換腳本=======

# csv file that you want to save to 
out = open("output.csv", "ab") 

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"] 
open_files = map(open, filenames) 

# change argument to the file you want to open 
for file in open_files: 
    for line in file: 
     # only keep tweets and not the empty lines 
     if line.rstrip(): 
      try: 
       tweets.append(json.loads(line)) 
      except: 
       pass 

for tweet in tweets: 
    ids.append(tweet["id_str"]) 
    texts.append(tweet["text"]) 
    time_created.append(tweet["created_at"]) 
    retweet_counts.append(tweet["retweet_count"]) 
... ... 

print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations" 
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations) 

csv = writer(out) 

for row in rows: 
    values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row] 
    csv.writerow(values) 

out.close()

來源

2014-05-15 Eugene Yan

你都加載到內存中（'tweets.append（json.loads（線））'）。你可以用你在讀完每行後立即寫入'output.csv'的方式來描述你的算法嗎？ – univerio

這可能是更適合http://codereview.stackexchange.com – dano

但是，雖然我在這裏，你應該一次打開一個文件。沒有理由一次打開它們。特別是因爲當你完成它們時你沒有關閉它們。 – dano

這條線就在這裏：

open_files = map(open, filenames)

一次同時打開每個文件。

然後，您讀取所有內容並將其放入同一個單個陣列tweets。

而且你有兩個主要for循環，所以每個鳴叫（其中有幾個GB值）通過~~迭代兩次~~驚人的4倍！因爲您在zip函數中添加了函數，然後將迭代寫入文件。任何一點都可能是內存錯誤的原因。

除非絕對必要，否則請嘗試僅觸摸每一條數據一次。在迭代文件時，處理該行並立即寫出。

嘗試這樣代替：

out = open("output.csv", "ab") 

filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"] 

def process_tweet_into_line(line): 
    # load as json, process turn into a csv and return 
    return line 

# change argument to the file you want to open 
for name in file_names: 
    with open(name) as file: 
     for line in file: 
      # only keep tweets and not the empty lines 
      if line.rstrip(): 
       try: 
        tweet = process_tweet_into_line(line) 
        out.write(line) 
       except: 
        pass

來源

2014-05-15 02:51:00

感謝@Lego Stormtroopr，但我在執行代碼時遇到了一些困難。你可以通過補充一點來幫助嗎？謝謝！ –

Python中的MemoryError：我如何優化我的代碼？

回答

相關問題