2014-01-24 64 views
4

我有一個密度非常低(設置爲0.1%的條目)的大型SparseDataFrame(比如20k索引x 10k列)。我試圖訪問特定的行數據框的,但我似乎無法做到這一點。訪問列雖然很好。下面是說明該問題的一個小例子:熊貓 - 訪問SparseDataFrame的行

import numpy as np 
import pandas as pd 

df = pd.DataFrame(np.arange(15).reshape(5,3), index=list('abcde')) 
df.loc['b',1] = np.nan # for good measure... 
sparse = df.to_sparse() 

sparse[1] # This is OK. 
df.loc['b'] # This is also OK. 
sparse.loc['b'] # This blows up. 

這裏的回溯:

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/Users/.../.virtualenvs/exp/lib/python2.7/site-packages/pandas/core/indexing.py", line 1020, in __getitem__ 
    return self._getitem_axis(key, axis=0) 
    File "/Users/.../.virtualenvs/exp/lib/python2.7/site-packages/pandas/core/indexing.py", line 1145, in _getitem_axis 
    return self._get_label(key, axis=axis) 
    File "/Users/.../.virtualenvs/exp/lib/python2.7/site-packages/pandas/core/indexing.py", line 68, in _get_label 
    return self.obj._xs(label, axis=axis, copy=True) 
    File "/Users/.../.virtualenvs/exp/lib/python2.7/site-packages/pandas/core/frame.py", line 2149, in xs 
    new_values, copy = self._data.fast_2d_xs(loc, copy=copy) 
    File "/Users/.../.virtualenvs/exp/lib/python2.7/site-packages/pandas/core/internals.py", line 2714, in fast_2d_xs 
    result[i] = blk._try_coerce_result(blk.iget((j, loc))) 
    File "/Users/.../.virtualenvs/exp/lib/python2.7/site-packages/pandas/core/internals.py", line 275, in iget 
    return self.values[i] 
    File "/Users/.../.virtualenvs/exp/lib/python2.7/site-packages/pandas/sparse/array.py", line 286, in __getitem__ 
    data_slice = self.values[key] 
IndexError: too many indices 

注意,在「正常」的,密集的數據幀反對它工作得很好。然而,由於大尺寸我這是一個重大的不便,我要麼:

  1. 轉置數據幀(需要年齡)
  2. 轉換爲密集的數據幀(吃了太多的內存)

我對熊貓比較陌生,所以也許我錯過了一些東西。無論如何,任何幫助表示讚賞!

+0

我不使用或瞭解不夠稀疏DFS的限制,但這個工程:'sparse.loc ['B ':'b']''如'sparse.ix ['b':'b']''一樣,我仍然沒有爲什麼不使用切片失敗 – EdChum

+0

@EdChum有趣的觀察。我看到的區別是,切片返回一個DataFrame而不是一個系列,所以也許問題在於此轉換以某種方式。 – lum

+1

這可能是未實現的:https://groups.google.com/forum/#!topic/pydata/YEdD8UrkV28,實際上它已經是一個請求:https://github.com/pydata/pandas/issues/4400 – EdChum

回答