2
我有一個相當大的數據集(2678271,52)和一個消耗機器內存6.5%的5維索引。 當我打電話pandas(pandas.pydata.org)在df.sortlevel(k)上拋出內存錯誤的時間?
df.sortlevel(k)
我收到以下錯誤:
MemoryError Traceback (most recent call last)
in()
----> 1 df = df.sortlevel(4)
/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/core/frame.pyc in sortlevel(self, level, axis, ascending)
2978 raise Exception('can only sort by level with a hierarchical index')
2979
-> 2980 new_axis, indexer = the_axis.sortlevel(level, ascending=ascending)
2981
2982 if self._data.is_mixed_dtype():
/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in sortlevel(self, level, ascending)
1856 indexer = _indexer_from_factorized((primary,) + tuple(labels),
1857 (primshp,) + tuple(shape),
-> 1858 compress=False)
1859 if not ascending:
1860 indexer = indexer[::-1]
/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _indexer_from_factorized(labels, shape, compress)
2124 max_group = np.prod(shape)
2125
-> 2126 indexer, _ = lib.groupsort_indexer(comp_ids.astype(np.int64), max_group)
2127
2128 return indexer
/usr/local/lib/python2.7/dist-packages/pandas-0.9.1-py2.7-linux-x86_64.egg/pandas/lib.so in pandas.lib.groupsort_indexer (pandas/src/tseries.c:55052)()
MemoryError:
有哪些引發此錯誤的硬編碼的條件?或者是否有可能即使數據只使用6.5%的內存(根據htop),操作會消耗剩餘的內存?
在0.10有很多性能增強。你能夠嘗試使用最新版本的熊貓嗎? http://pandas.pydata.org/pandas-docs/stable/whatsnew.html – Zelazny7
0.10還有一些東西讓我很難切換。在這種情況下,我必須等待0.10.1。但是在這個問題上是否有具體的變化可以解釋這種行爲? –
一個'inplace'選項被添加到'sortlevel'中,這可能會減少內存使用量:https://github.com/pydata/pandas/issues/1873 – Zelazny7