2013-02-19 31 views
5

在一個筆記本大熊貓這裏分配作爲新列

http://nbviewer.ipython.org/urls/raw.github.com/carljv/Will_it_Python/master/ARM/ch5/arsenic_wells_switching.ipynb

我看到qcut的結果被分配作爲新列的數據幀。 Dataframe有兩列,但不知何故將qcut輸出分配給一個新列可以找到「var」變量所在的正確級別 - 其他變量不會被檢查。這裏的熊貓語義是什麼?示例輸出如下:

In [2]: 
from pandas import * 
from statsmodels.formula.api import logit 
from statsmodels.nonparametric import KDE 
from patsy import dmatrix, dmatrices 

In [3]: 
df = read_csv('wells.dat', sep = ' ', header = 0, index_col = 0) 
print df.head() 
    switch arsenic  dist assoc educ 
1  1  2.36 16.826000  0  0 
2  1  0.71 47.321999  0  0 
3  0  2.07 20.966999  0 10 
4  1  1.15 21.486000  0 12 
5  1  1.10 40.874001  1 14 


In [4]: 
model_form = ('switch ~ center(I(dist/100.)) + center(arsenic) + ' + 
       'center(I(educ/4.)) + ' + 
       'center(I(dist/100.)) : center(arsenic) + ' + 
       'center(I(dist/100.)) : center(I(educ/4.)) + ' + 
       'center(arsenic) : center(I(educ/4.))' 
      ) 
model4 = logit(model_form, df = df).fit() 

In [20]: 
resid_df = DataFrame({'var': df['arsenic'], 'resid': model4.resid}) 
resid_df[:10] 
Out [20]: 
     resid var 
1 0.842596 2.36 
2 1.281417 0.71 
3 -1.613751 2.07 
4 0.996195 1.15 
5 1.005102 1.10 
6 0.592056 3.90 
7 0.941372 2.97 
8 0.640139 3.24 
9 0.886626 3.28 
10 1.130149 2.52 

In [15]: 
qcut(df['arsenic'], 40) 
Out [15]: 
Categorical: arsenic 
array([(2.327, 2.47], (0.68, 0.71], (1.953, 2.07], ..., [0.51, 0.53], 
     (0.62, 0.64], (0.64, 0.68]], dtype=object) 
Levels (40): Index([[0.51, 0.53], (0.53, 0.56], (0.56, 0.59], 
        (0.59, 0.62], (0.62, 0.64], (0.64, 0.68], 
        (0.68, 0.71], (0.71, 0.75], (0.75, 0.78], 
        (0.78, 0.82], (0.82, 0.86], (0.86, 0.9], (0.9, 0.95], 
        (0.95, 1.0065], (1.0065, 1.0513], (1.0513, 1.1], 
        (1.1, 1.15], (1.15, 1.2], (1.2, 1.25], (1.25, 1.3], 
        (1.3, 1.36], (1.36, 1.42], (1.42, 1.49], 
        (1.49, 1.57], (1.57, 1.66], (1.66, 1.76], 
        (1.76, 1.858], (1.858, 1.953], (1.953, 2.07], 
        (2.07, 2.2], (2.2, 2.327], (2.327, 2.47], 
        (2.47, 2.61], (2.61, 2.81], (2.81, 2.98], 
        (2.98, 3.21], (3.21, 3.42], (3.42, 3.791], 
        (3.791, 4.475], (4.475, 9.65]], dtype=object) 

In [17]: 
resid_df['bins'] = qcut(df['arsenic'], 40) 
resid_df[:20] 
Out [17]: 
     resid var   bins 
1 0.842596 2.36 (2.327, 2.47] 
2 1.281417 0.71 (0.68, 0.71] 
3 -1.613751 2.07 (1.953, 2.07] 
4 0.996195 1.15  (1.1, 1.15] 
5 1.005102 1.10 (1.0513, 1.1] 
6 0.592056 3.90 (3.791, 4.475] 
7 0.941372 2.97 (2.81, 2.98] 
8 0.640139 3.24 (3.21, 3.42] 

找到了「var」的正確bin,該賦值沒有注意「resid」。

回答

1

我想我明白了......在qcut Categorical(它的結果)對象上有一個「標籤」屬性;對於每個點,標籤都會根據點落入的四分位數攜帶一個數字,例如1,2,3。然後,如果將qcut結果分配到DataFrame上的新列中,則Pandas將此「標籤」與DataFrame的「索引」匹配。

+0

你能提供一個代碼示例嗎?因爲做你描述的東西似乎並不奏效。 – feetwet 2016-12-23 21:44:19