2016-04-20 83 views

回答

1
def split_dataframe(df, size): 

    # size of each row 
    row_size = df.memory_usage().sum()/len(df) 

    # maximum number of rows of each segment 
    row_limit = size // row_size 

    # number of segments 
    seg_num = (len(df) + row_limit - 1) // row_limit 

    # split df 
    segments = [df.iloc[i*row_limit : (i+1)*row_limit] for i in range(seg_num)] 

    return segments 
+0

您的解決方案是通用的,接受! – Segmented

0

最簡單的方法是如果數據框的列是一致的數據類型(即不是對象)。這裏有一個例子說明你如何去做這件事。

import numpy as np 
import pandas as pd 
from __future__ import division 

df = pd.DataFrame({'a': [1]*100, 'b': [1.1, 2] * 50, 'c': range(100)}) 

# calculate the number of bytes a row occupies 
row_bytes = df.dtypes.apply(lambda x: x.itemsize).sum() 

mem_limit = 1024 

# get the maximum number of rows in a segment 
max_rows = mem_limit/row_bytes 

# get the number of dataframes after splitting 
n_dfs = np.ceil(df.shape[0]/max_rows) 

# get the indices of the dataframe segments 
df_segments = np.array_split(df.index, n_dfs) 

# create a list of dataframes that are below mem_limit 
split_dfs = [df.loc[seg, :] for seg in df_segments] 

split_dfs 

此外,如果您可以通過列而不是行拆分,大熊貓有一個方便的方法memory_usage