我有大量的json文件要結合並輸出爲單個csv(加載到R),每個json文件約爲1.5GB。在每個250mb的4-5個json文件上進行試驗時,我會在下面看到以下錯誤。我在8gb ram和Windows 7 professional 64位上運行Python版本'2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
。Python中的MemoryError:我如何優化我的代碼?
我是一名Python新手,在編寫優化代碼方面經驗不多,並且非常感謝如何優化我的腳本。謝謝!
=======的Python的MemoryError =======
Traceback (most recent call last):
File "C:\Users\...\tweetjson_to_csv.py", line 52, in <module>
for line in file:
MemoryError
[Finished in 29.5s]
======= JSON到CSV轉換腳本=======
# csv file that you want to save to
out = open("output.csv", "ab")
filenames = ["8may.json", "9may.json", "10may.json", "11may.json", "12may.json"]
open_files = map(open, filenames)
# change argument to the file you want to open
for file in open_files:
for line in file:
# only keep tweets and not the empty lines
if line.rstrip():
try:
tweets.append(json.loads(line))
except:
pass
for tweet in tweets:
ids.append(tweet["id_str"])
texts.append(tweet["text"])
time_created.append(tweet["created_at"])
retweet_counts.append(tweet["retweet_count"])
... ...
print >> out, "ids,text,time_created,retweet_counts,in_reply_to,geos,coordinates,places,country,language,screen_name,followers,friends,statuses,locations"
rows = zip(ids,texts,time_created,retweet_counts,in_reply_to_screen_name,geos,coordinates,places,places_country,lang,user_screen_names,user_followers_count,user_friends_count,user_statuses_count,user_locations)
csv = writer(out)
for row in rows:
values = [(value.encode('utf8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
out.close()
你都加載到內存中('tweets.append(json.loads(線))')。你可以用你在讀完每行後立即寫入'output.csv'的方式來描述你的算法嗎? – univerio
這可能是更適合http://codereview.stackexchange.com – dano
但是,雖然我在這裏,你應該一次打開一個文件。沒有理由一次打開它們。特別是因爲當你完成它們時你沒有關閉它們。 – dano