2013-02-07 43 views
0

採用多指標這是後續問題的回答這個問題:對數據幀

pandas performance issue - need help to optimize

以下建議工作:

df = DataFrame(np.arange(20).reshape(5,4)) 
df2 = df.set_index(keys=[0,1,2]) 
df2.ix[(4,5,6)] 

了使用多指標

所以我創建了一個如下所示的文件sample_data.csv:

col1,col2,year,amount 
111111,3.5,2012,700 
111112,3.5,2011,600 
222221,4.0,2012,222 
... 

我然後跑了以下內容:

import numpy as np 
import pandas as pd 
sd=pd.read_csv('sample_data.csv') 
sd2=sd.set_index(keys=['col2','year']) 
sd2.ix[(4.0,2012)] 

但是,這會產生以下錯誤: IndexError:索引越界

任何想法,爲什麼它在前者的情況下而不是後者的? 這是錯誤的樣子:


IndexError        Traceback (most recent call last) 
<ipython-input-19-1d72a961db95> in <module>() 
----> 1 sd2.ix[(4.0,2012)] 

/Library/Python/2.7/site-packages/pandas-0.8.1-py2.7-macosx-10.7-intel.egg/pandas/core/indexing.pyc in __getitem__(self, key) 
    31     pass 
    32 
---> 33    return self._getitem_tuple(key) 
    34   else: 
    35    return self._getitem_axis(key, axis=0) 
+1

對我來說你的代碼有效。你使用哪種版本的熊貓? – joris

+0

它也適用於我(在Pd 10.0中)。如果您使用:pd.read_csv('sample_data.csv',index_col = ['col2','year']) –

+0

pandas-0.8.1,也可以跳過set_index步驟。這是爲什麼它的失敗? – femibyte

回答

1

爲了展示我的作品(熊貓0.10.1):

In [1]: from StringIO import StringIO 
In [2]: import numpy as np 
In [3]: import pandas as pd 
In [4]: s = StringIO("""col1,col2,year,amount 
    ...: 111111,3.5,2012,700 
    ...: 111112,3.5,2011,600 
    ...: 222221,4.0,2012,222""") 

In [5]: sd=pd.read_csv(s) 
In [6]: sd2=sd.set_index(keys=['col2','year']) 
In [7]: sd2.ix[(4.0,2012)] 
Out[7]: 
col1  222221 
amount  222 
Name: (4.0, 2012) 

但是,如果我添加一行與重複索引,我也得到同樣的錯誤:

In [8]: s = StringIO("""col1,col2,year,amount 
    ...: 111111,3.5,2012,700 
    ...: 111112,3.5,2011,600 
    ...: 222221,4.0,2012,222 
    ...: 222221,4.0,2012,223""") 

In [9]: sd=pd.read_csv(s) 
In [10]: sd2=sd.set_index(keys=['col2','year']) 
In [11]: sd2.ix[(4.0,2012)] 
--------------------------------------------------------------------------- 
IndexError        Traceback (most recent call last) 
<ipython-input-7-1b787a1d99df> in <module>() 
----> 1 sd2.ix[(4.0,2012)] 

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key) 
    32     pass 
    33 
---> 34    return self._getitem_tuple(key) 
    35   else: 
    36    return self._getitem_axis(key, axis=0) 

... 

IndexError: index out of bounds 

是否有可能在('col1','year')中有重複值?

我不知道它是一個錯誤還是MultiIndex上的一個約束(但在這種情況下,錯誤消息可能會更清晰,我認爲)。在這個

In [21]: sd=pd.read_csv(s) 

In [22]: sd = sd.drop_duplicates(['col2', 'year']) 

In [23]: sd2=sd.set_index(keys=['col2','year']) 

In [24]: sd2.ix[(4.0,2012)] 
Out[24]: 
col1  222221 
amount  222 
Name: (4.0, 2012) 

欲瞭解更多信息,請參見::但你可以在設置指標如下之前刪除重複值http://pandas.pydata.org/pandas-docs/stable/indexing.html#duplicate-datahttp://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop_duplicates.html

+0

是的,這是問題,非常感謝您的洞察力。我打算使用MultiIndex作爲選擇基於多列的DataFrame行的更有效方法(請參閱http://stackoverflow.com/questions/14737566/pandas-performance-issue-need-help-to-optimize/ 14750813#14750813),但由於索引必須是唯一的,我不能使用這種方法。 – femibyte