對數據幀

採用多指標這是後續問題的回答這個問題：對數據幀

pandas performance issue - need help to optimize

以下建議工作：

df = DataFrame(np.arange(20).reshape(5,4)) 
df2 = df.set_index(keys=[0,1,2]) 
df2.ix[(4,5,6)]

了使用多指標

所以我創建了一個如下所示的文件sample_data.csv：

col1,col2,year,amount 
111111,3.5,2012,700 
111112,3.5,2011,600 
222221,4.0,2012,222 
...

我然後跑了以下內容：

import numpy as np 
import pandas as pd 
sd=pd.read_csv('sample_data.csv') 
sd2=sd.set_index(keys=['col2','year']) 
sd2.ix[(4.0,2012)]

但是，這會產生以下錯誤： IndexError：索引越界

任何想法，爲什麼它在前者的情況下而不是後者的？這是錯誤的樣子：

IndexError        Traceback (most recent call last) 
<ipython-input-19-1d72a961db95> in <module>() 
----> 1 sd2.ix[(4.0,2012)] 

/Library/Python/2.7/site-packages/pandas-0.8.1-py2.7-macosx-10.7-intel.egg/pandas/core/indexing.pyc in __getitem__(self, key) 
    31     pass 
    32 
---> 33    return self._getitem_tuple(key) 
    34   else: 
    35    return self._getitem_axis(key, axis=0)

來源

2013-02-07 femibyte

對我來說你的代碼有效。你使用哪種版本的熊貓？ – joris

它也適用於我（在Pd 10.0中）。如果您使用：pd.read_csv（'sample_data.csv'，index_col = ['col2'，'year']） –

pandas-0.8.1，也可以跳過set_index步驟。這是爲什麼它的失敗？ – femibyte

爲了展示我的作品（熊貓0.10.1）：

In [1]: from StringIO import StringIO 
In [2]: import numpy as np 
In [3]: import pandas as pd 
In [4]: s = StringIO("""col1,col2,year,amount 
    ...: 111111,3.5,2012,700 
    ...: 111112,3.5,2011,600 
    ...: 222221,4.0,2012,222""") 

In [5]: sd=pd.read_csv(s) 
In [6]: sd2=sd.set_index(keys=['col2','year']) 
In [7]: sd2.ix[(4.0,2012)] 
Out[7]: 
col1  222221 
amount  222 
Name: (4.0, 2012)

但是，如果我添加一行與重複索引，我也得到同樣的錯誤：

In [8]: s = StringIO("""col1,col2,year,amount 
    ...: 111111,3.5,2012,700 
    ...: 111112,3.5,2011,600 
    ...: 222221,4.0,2012,222 
    ...: 222221,4.0,2012,223""") 

In [9]: sd=pd.read_csv(s) 
In [10]: sd2=sd.set_index(keys=['col2','year']) 
In [11]: sd2.ix[(4.0,2012)] 
--------------------------------------------------------------------------- 
IndexError        Traceback (most recent call last) 
<ipython-input-7-1b787a1d99df> in <module>() 
----> 1 sd2.ix[(4.0,2012)] 

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key) 
    32     pass 
    33 
---> 34    return self._getitem_tuple(key) 
    35   else: 
    36    return self._getitem_axis(key, axis=0) 

... 

IndexError: index out of bounds

是否有可能在（'col1'，'year'）中有重複值？

我不知道它是一個錯誤還是MultiIndex上的一個約束（但在這種情況下，錯誤消息可能會更清晰，我認爲）。在這個

In [21]: sd=pd.read_csv(s) 

In [22]: sd = sd.drop_duplicates(['col2', 'year']) 

In [23]: sd2=sd.set_index(keys=['col2','year']) 

In [24]: sd2.ix[(4.0,2012)] 
Out[24]: 
col1  222221 
amount  222 
Name: (4.0, 2012)

欲瞭解更多信息，請參見：：但你可以在設置指標如下之前刪除重複值http://pandas.pydata.org/pandas-docs/stable/indexing.html#duplicate-data和http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop_duplicates.html。

來源

2013-02-08 09:06:34 joris

是的，這是問題，非常感謝您的洞察力。我打算使用MultiIndex作爲選擇基於多列的DataFrame行的更有效方法（請參閱http://stackoverflow.com/questions/14737566/pandas-performance-issue-need-help-to-optimize/ 14750813＃14750813），但由於索引必須是唯一的，我不能使用這種方法。 – femibyte

回答

相關問題