2017-09-06 38 views
1

我試圖在小於50k行的excel表格上工作。我想要做的是 - 使用特定的列,我想獲得所有的唯一值,然後通過使用唯一值,我想獲得包含該值的所有行,並將它們放入此格式大熊貓在使用excel文件時花費的時間太長,消耗的內存太多

[{ 
"unique_field_value": [Array containing row data that match the unique value as dictionaries] 
},] 

事情是當我測試像1000行一樣少的行時一切順利。隨着數量的增長,內存使用量也會增加,直到它不能再增長並且我的電腦死機。那麼,有沒有什麼東西對熊貓不適合? 。這裏是我的平臺的詳細信息:

DISTRIB_ID=Ubuntu 
DISTRIB_RELEASE=16.04 
DISTRIB_CODENAME=xenial 
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS" 
NAME="Ubuntu" 
VERSION="16.04.3 LTS (Xenial Xerus)" 
ID_LIKE=debian 
VERSION_ID="16.04" 

這裏是我的代碼上午我上Jupyter筆記本運行

import pandas as pd 
import simplejson 
import datetime 

def datetime_handler(x): 
    if isinstance(x, datetime.datetime): 
     return x.isoformat() 
    raise TypeError("Type not Known") 

path = "/home/misachi/Downloads/new members/my_file.xls" 
df = pd.read_excel(path, index_col=None, skiprows=[0]) 
df = df.dropna(thresh=5) 
df2 = df.drop_duplicates(subset=['corporate']) 

schemes = df2['corporate'].values 

result_list = [] 
result_dict = {} 

for count, name in enumerate(schemes): 
    inner_dict = {} 
    col_val = schemes[count] 
    foo = df['corporate'] == col_val 
    data = df[foo].to_json(orient='records', date_format='iso') 
    result_dict[name] = simplejson.loads(data) 
    result_list.append(result_dict) 
#  print(result_list) 
#  if count == 3: 
#   break 

dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler) 

with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f: 
    json_f.write(dumped) 

編輯

下面是示例輸出預計

[{ 
    "TABBY MEMORIAL CATHEDRAL": [{ 
     "corp_id": 8494, 
     "smart": null, 
     "copay": null, 
     "corporate": "TABBY MEMORIAL CATHEDRAL", 
     "category": "CAT A", 
     "member_names": "Brian Maombi", 
     "member_no": "84984", 
     "start_date": "2017-03-01T00:00:00.000Z", 
     "end_date": "2018-02-28T00:00:00.000Z", 
     "outpatient": "OUTPATIENT" 
    }, { 
     "corp_id": 8494, 
     "smart": null, 
     "copay": null, 
     "corporate": "TABBY MEMORIAL CATHEDRAL", 
     "category": "CAT A", 
     "member_names": "Omula Peter", 
     "member_no": "4784984", 
     "start_date": "2017-03-01T00:00:00.000Z", 
     "end_date": "2018-02-28T00:00:00.000Z", 
     "outpatient": "OUTPATIENT" 
    }], 
    "CHECKIFY KENYA LTD": [{ 
     "corp_id": 7489, 
     "smart": "SMART", 
     "copay": null, 
     "corporate": "CHECKIFY KENYA LTD", 
     "category": "CAT A", 
     "member_names": "BENARD KONYI", 
     "member_no": "ABB/8439", 
     "start_date": "2017-08-01T00:00:00.000Z", 
     "end_date": "2018-07-31T00:00:00.000Z", 
     "outpatient": "OUTPATIENT" 
    }, { 
     "corp_id": 7489, 
     "smart": "SMART", 
     "copay": null, 
     "corporate": "CHECKIFY KENYA LTD", 
     "category": "CAT A", 
     "member_names": "KEVIN WACHAI", 
     "member_no": "ABB/67484", 
     "start_date": "2017-08-01T00:00:00.000Z", 
     "end_date": "2018-07-31T00:00:00.000Z", 
     "outpatient": "OUTPATIENT" 
    }] 
}] 

完整清潔的代碼是:

import os 
import pandas as pd 
import simplejson 
import datetime 


def datetime_handler(x): 
    if isinstance(x, datetime.datetime): 
     return x.isoformat() 
    raise TypeError("Unknown type") 


def work_on_data(filename): 
    if not os.path.isfile(filename): 
     raise IOError 
    df = pd.read_excel(filename, index_col=None, skiprows=[0]) 
    df = df.dropna(thresh=5) 

    result_list = [{n: g.to_dict('records')} for n, g in df.groupby('corporate')] 

    dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler) 
    return dumped 
dumped = work_on_data('/home/misachi/Downloads/new members/my_file.xls') 
with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f: 
    json_f.write(dumped) 

回答

1

獲取與

result_dict = [{n: g.to_dict('records') for n, g in df.groupby('corporate')}] 
+0

這樣做的效果更快,更高效,甚至更乾淨。但它不返回指定格式的數據,即[{key:val}],其中key是唯一字段名稱,val是包含具有相同唯一字段值的所有行的數據的字典列表。 – Misachi

+0

我建議你製作一個實際的例子,並顯示所需的輸出,這樣我就不必猜測你正在嘗試做什麼。 – piRSquared

+0

我在上面的編輯中添加了一個示例輸出 – Misachi

0

字典指定通過文件chunksize=10000參數與read_excel()和循環,直到你到達數據的結尾。這將幫助您在處理大文件時管理內存。如果你有多張紙來管理,請按照以下步驟操作:this example

for chunk in pd.read_excel(path, index_col=None, skiprows=[0] chunksize=10000): 
     df = chunk.dropna(thresh=5) 
     df2 = df.drop_duplicates(subset=['corporate']) 
     # rest of your code 
+0

指定chunksize會引發NotImplementedError。該文檔也不包括它。 – Misachi