我試圖在小於50k行的excel表格上工作。我想要做的是 - 使用特定的列,我想獲得所有的唯一值,然後通過使用唯一值,我想獲得包含該值的所有行,並將它們放入此格式大熊貓在使用excel文件時花費的時間太長,消耗的內存太多
[{
"unique_field_value": [Array containing row data that match the unique value as dictionaries]
},]
事情是當我測試像1000行一樣少的行時一切順利。隨着數量的增長,內存使用量也會增加,直到它不能再增長並且我的電腦死機。那麼,有沒有什麼東西對熊貓不適合? 。這裏是我的平臺的詳細信息:
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID_LIKE=debian
VERSION_ID="16.04"
這裏是我的代碼上午我上Jupyter筆記本運行
import pandas as pd
import simplejson
import datetime
def datetime_handler(x):
if isinstance(x, datetime.datetime):
return x.isoformat()
raise TypeError("Type not Known")
path = "/home/misachi/Downloads/new members/my_file.xls"
df = pd.read_excel(path, index_col=None, skiprows=[0])
df = df.dropna(thresh=5)
df2 = df.drop_duplicates(subset=['corporate'])
schemes = df2['corporate'].values
result_list = []
result_dict = {}
for count, name in enumerate(schemes):
inner_dict = {}
col_val = schemes[count]
foo = df['corporate'] == col_val
data = df[foo].to_json(orient='records', date_format='iso')
result_dict[name] = simplejson.loads(data)
result_list.append(result_dict)
# print(result_list)
# if count == 3:
# break
dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler)
with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f:
json_f.write(dumped)
編輯
下面是示例輸出預計
[{
"TABBY MEMORIAL CATHEDRAL": [{
"corp_id": 8494,
"smart": null,
"copay": null,
"corporate": "TABBY MEMORIAL CATHEDRAL",
"category": "CAT A",
"member_names": "Brian Maombi",
"member_no": "84984",
"start_date": "2017-03-01T00:00:00.000Z",
"end_date": "2018-02-28T00:00:00.000Z",
"outpatient": "OUTPATIENT"
}, {
"corp_id": 8494,
"smart": null,
"copay": null,
"corporate": "TABBY MEMORIAL CATHEDRAL",
"category": "CAT A",
"member_names": "Omula Peter",
"member_no": "4784984",
"start_date": "2017-03-01T00:00:00.000Z",
"end_date": "2018-02-28T00:00:00.000Z",
"outpatient": "OUTPATIENT"
}],
"CHECKIFY KENYA LTD": [{
"corp_id": 7489,
"smart": "SMART",
"copay": null,
"corporate": "CHECKIFY KENYA LTD",
"category": "CAT A",
"member_names": "BENARD KONYI",
"member_no": "ABB/8439",
"start_date": "2017-08-01T00:00:00.000Z",
"end_date": "2018-07-31T00:00:00.000Z",
"outpatient": "OUTPATIENT"
}, {
"corp_id": 7489,
"smart": "SMART",
"copay": null,
"corporate": "CHECKIFY KENYA LTD",
"category": "CAT A",
"member_names": "KEVIN WACHAI",
"member_no": "ABB/67484",
"start_date": "2017-08-01T00:00:00.000Z",
"end_date": "2018-07-31T00:00:00.000Z",
"outpatient": "OUTPATIENT"
}]
}]
完整清潔的代碼是:
import os
import pandas as pd
import simplejson
import datetime
def datetime_handler(x):
if isinstance(x, datetime.datetime):
return x.isoformat()
raise TypeError("Unknown type")
def work_on_data(filename):
if not os.path.isfile(filename):
raise IOError
df = pd.read_excel(filename, index_col=None, skiprows=[0])
df = df.dropna(thresh=5)
result_list = [{n: g.to_dict('records')} for n, g in df.groupby('corporate')]
dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler)
return dumped
dumped = work_on_data('/home/misachi/Downloads/new members/my_file.xls')
with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f:
json_f.write(dumped)
這樣做的效果更快,更高效,甚至更乾淨。但它不返回指定格式的數據,即[{key:val}],其中key是唯一字段名稱,val是包含具有相同唯一字段值的所有行的數據的字典列表。 – Misachi
我建議你製作一個實際的例子,並顯示所需的輸出,這樣我就不必猜測你正在嘗試做什麼。 – piRSquared
我在上面的編輯中添加了一個示例輸出 – Misachi