2013-02-01 106 views
6

我在屏蔽面板時遇到了一些麻煩,就像我將DataFrame一樣。我想要做的事情很簡單,但我還沒有找到一種方式來查看文檔和在線論壇。下面我有一個簡單的例子:熊貓面板中的布爾掩模

import pandas 
import numpy as np 
import datetime 
start_date = datetime.datetime(2009,3,1,6,29,59) 
r = pandas.date_range(start_date, periods=12) 
cols_1 = ['AAPL', 'AAPL', 'GOOG', 'GOOG', 'GS', 'GS'] 
cols_2 = ['close', 'rate', 'close', 'rate', 'close', 'rate'] 
dat = np.random.randn(12, 6) 

dftst = pandas.DataFrame(dat, columns=pandas.MultiIndex.from_arrays([cols_1, cols_2], names=['ticker','field']), index=r) 
pn = dftst.T.to_panel().transpose(2,0,1) 
print pn 

Out[14]: 
<class 'pandas.core.panel.Panel'> 
Dimensions: 2 (items) x 12 (major_axis) x 3 (minor_axis) 
Items axis: close to rate 
Major_axis axis: 2009-03-01 06:29:59 to 2009-03-12 06:29:59 
Minor_axis axis: AAPL to GS 

我現在有一個Panel對象,如果我分得一杯羹沿物品軸,我得到一個數據幀

close_p = pn['close'] 
print close_p 

Out[16]: 
ticker     AAPL  GOOG  GS 
2009-03-01 06:29:59 -0.082203 -0.286354 1.227193 
2009-03-02 06:29:59 0.340005 -0.688933 -1.505137 
2009-03-03 06:29:59 -0.525567 0.321858 -0.035047 
2009-03-04 06:29:59 -0.123549 -0.841781 -0.616523 
2009-03-05 06:29:59 -0.407504 0.188372 1.311262 
2009-03-06 06:29:59 0.272883 0.817179 0.584664 
2009-03-07 06:29:59 -1.767227 1.168876 0.443096 
2009-03-08 06:29:59 -0.685501 -0.534373 -0.063906 
2009-03-09 06:29:59 0.851820 0.068740 0.566537 
2009-03-10 06:29:59 0.390678 -0.012422 -0.152375 
2009-03-11 06:29:59 -0.985585 -0.917705 -0.585091 
2009-03-12 06:29:59 0.067498 -0.764343 0.497270 

我可以用兩種方法篩選此數據:

1)創建的掩模和掩模數據,如下所示:

msk = close_p > 0 
close_p = close_p.mask(msk) 

2)I可以僅通過布爾運算切片在msk以上的發起人

close_p = close_p[close_p > 0] 
Out[28]: 
ticker     AAPL  GOOG  GS 
2009-03-01 06:29:59  NaN  NaN 1.227193 
2009-03-02 06:29:59 0.340005  NaN  NaN 
2009-03-03 06:29:59  NaN 0.321858  NaN 
2009-03-04 06:29:59  NaN  NaN  NaN 
2009-03-05 06:29:59  NaN 0.188372 1.311262 
2009-03-06 06:29:59 0.272883 0.817179 0.584664 
2009-03-07 06:29:59  NaN 1.168876 0.443096 
2009-03-08 06:29:59  NaN  NaN  NaN 
2009-03-09 06:29:59 0.851820 0.068740 0.566537 
2009-03-10 06:29:59 0.390678  NaN  NaN 
2009-03-11 06:29:59  NaN  NaN  NaN 
2009-03-12 06:29:59 0.067498  NaN 0.497270 

我無法弄清楚如何做的是過濾我所有的數據基於一個沒有for循環的掩碼。我可以做到以下幾點:

msk = (pn['rate'] > 0) & (pn['close'] > 0) 
def mask_panel(pan, msk): 
    for item in pan.items: 
     pan[item] = pan[item].mask(msk) 
    return pan 
print pn['close'] 

Out[32]: 
ticker     AAPL  GOOG  GS 
2009-03-01 06:29:59 -0.082203 -0.286354 1.227193 
2009-03-02 06:29:59 0.340005 -0.688933 -1.505137 
2009-03-03 06:29:59 -0.525567 0.321858 -0.035047 
2009-03-04 06:29:59 -0.123549 -0.841781 -0.616523 
2009-03-05 06:29:59 -0.407504 0.188372 1.311262 
2009-03-06 06:29:59 0.272883 0.817179 0.584664 
2009-03-07 06:29:59 -1.767227 1.168876 0.443096 
2009-03-08 06:29:59 -0.685501 -0.534373 -0.063906 
2009-03-09 06:29:59 0.851820 0.068740 0.566537 
2009-03-10 06:29:59 0.390678 -0.012422 -0.152375 
2009-03-11 06:29:59 -0.985585 -0.917705 -0.585091 
2009-03-12 06:29:59 0.067498 -0.764343 0.497270 

mask_panel(pn, msk) 

print pn['close'] 

Out[34]: 
ticker     AAPL  GOOG  GS 
2009-03-01 06:29:59 -0.082203 -0.286354  NaN 
2009-03-02 06:29:59  NaN -0.688933 -1.505137 
2009-03-03 06:29:59 -0.525567  NaN -0.035047 
2009-03-04 06:29:59 -0.123549 -0.841781 -0.616523 
2009-03-05 06:29:59 -0.407504  NaN  NaN 
2009-03-06 06:29:59  NaN  NaN  NaN 
2009-03-07 06:29:59 -1.767227  NaN  NaN 
2009-03-08 06:29:59 -0.685501 -0.534373 -0.063906 
2009-03-09 06:29:59  NaN  NaN  NaN 
2009-03-10 06:29:59  NaN -0.012422 -0.152375 
2009-03-11 06:29:59 -0.985585 -0.917705 -0.585091 
2009-03-12 06:29:59  NaN -0.764343  NaN 

所以上面的循環有訣竅。我知道使用ndarray有更快的矢量化方式,但我還沒有把它們放在一起。它似乎也應該是內置在熊貓庫中的功能。如果有辦法做到這一點,我錯過了,任何建議將不勝感激。

+0

這感覺就像是你應該能夠使用布爾面板'pn.gt(0)'... –

+0

謝謝安迪,除非我錯了我認爲這會做一些不同的事情。這將在我的面板中的每個DataFrame的值都小於0的範圍內進行調整。我想要執行的操作是在面板中的每個DataFrame中進行調整,其中'close'小於0.再次,close是我面板中的一個特定DataFrame 。如果我想出更好的東西,我會繼續擺弄和張貼。 – granders19

+0

只會影響關閉數據框(面板的一部分)嗎?你想在面板中改變它,並保持另一個不變嗎? –

回答

9

我認爲這會工作(什麼Panel.where應該做的,但它有點不平凡的,因爲它 要處理一堆的病例)

# construct the mask in 2-d (a frame) 
In [36]: mask = (pn['close']>0) & (pn['rate']>0) 

In [37]: mask 
Out[37]: 
ticker    AAPL GOOG  GS 
2009-03-01 06:29:59 False False False 
2009-03-02 06:29:59 False False True 
.... 

# here's the key, this broadcasts, setting the values which 
# don't meet the condition to nan 
In [38]: masked_values = np.where(mask,pn.values,np.nan) 

# reconstruct the panel (the _construct_axes_dict is an internal function that returns 
# dict of the axes, e.g. items -> the items, major_axis -> ..... 
In [42]: x = pd.Panel(masked_values,**pn._construct_axes_dict()) 
Out[42]: 
<class 'pandas.core.panel.Panel'> 
Dimensions: 2 (items) x 12 (major_axis) x 3 (minor_axis) 
Items axis: close to rate 
Major_axis axis: 2009-03-01 06:29:59 to 2009-03-12 06:29:59 
Minor_axis axis: AAPL to GS 

# the values 
In [43]: x 
Out[43]: 
array([[[  nan,   nan,   nan], 
    [  nan,   nan, 0.09575723], 
    [  nan,   nan,   nan], 
    [  nan,   nan,   nan], 
    [  nan, 2.07229823, 0.04347515], 
    [  nan,   nan,   nan], 
    [  nan,   nan, 2.18342239], 
    [  nan,   nan, 1.73674381], 
    [  nan, 2.01173087,   nan], 
    [ 0.24109645, 0.94583072,   nan], 
    [ 0.36953467,   nan, 0.18044432], 
    [ 1.74164222, 1.02314752, 1.73736033]], 

    [[  nan,   nan,   nan], 
    [  nan,   nan, 0.06960387], 
    [  nan,   nan,   nan], 
    [  nan,   nan,   nan], 
    [  nan, 0.63202199, 0.56724391], 
    [  nan,   nan,   nan], 
    [  nan,   nan, 0.71964824], 
    [  nan,   nan, 1.03482927], 
    [  nan, 0.18256148,   nan], 
    [ 1.29451667, 0.49804327,   nan], 
    [ 2.04726538,   nan, 0.12883128], 
    [ 0.70647885, 0.7277734 , 0.77844475]]]) 
+0

謝謝傑夫,那非常棒!這是比我想出的循環更好的解決方案。我同意如果將它構建到面板的.where方法中,將會很好。 – granders19

+0

沒有問題 - 將在某些時候達到它https://github.com/pydata/pandas/issues/2790 – Jeff