2016-05-02 102 views
0

不同的指數組合dataframes我已經生成概率的數據幀從一個scikit學習分類是這樣的:與熊貓

def preprocess_category_series(series, key): 
    if series.dtype != 'category': 
     return series 
    if series.cat.ordered: 
     s = pd.Series(series.cat.codes, name=key) 
     mode = s.mode()[0] 
     s[s<0] = mode 
     return s 
    else: 
     return pd.get_dummies(series, drop_first=True, prefix=key) 

data = df[df.year == 2012] 
factors = pd.concat([preprocess_category_series(data[k], k) for k in factor_keys], axis=1) 
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(factors)]) 

我現在想這些概率追加回到我原來的數據幀。但是,上面生成的predictions數據幀在保留data中的項目順序的同時,已經丟失了data的索引。我認爲我能夠做到

pd.concat([data, predictions], axis=1, ignore_index=True) 

但是這會產生錯誤:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects 

我已經看到了這有時會出現,如果列名是重複的,但在這種情況下,沒有一個是。那是什麼錯誤?將這些數據框拼接在一起的最佳方式是什麼?

   year serial hwtfinl      region statefip \ 
cpsid                   
20121000000100 2012  1 3796.85 East South Central Division Alabama 
20121000000100 2012  1 3796.85 East South Central Division Alabama 
20121000000100 2012  1 3796.85 East South Central Division Alabama 
20120800000500 2012  6 2814.24 East South Central Division Alabama 
20120800000600 2012  7 2828.42 East South Central Division Alabama 

       county month pernum   cpsidp  wtsupp ... \ 
cpsid                ...  
20121000000100  0  11  1 20121000000101 3208.1213 ...  
20121000000100  0  11  2 20121000000102 3796.8506 ...  
20121000000100  0  11  3 20121000000103 3386.4305 ...  
20120800000500  0  11  1 20120800000501 2814.2417 ...  
20120800000600 1097  11  1 20120800000601 2828.4193 ...  

       race  hispan educ   votereg \ 
cpsid               
20121000000100 White Not Hispanic 111    Voted 
20121000000100 White Not Hispanic 111 Did not register 
20121000000100 White Not Hispanic 111    Voted 
20120800000500 White Not Hispanic 92    Voted 
20120800000600 White Not Hispanic 73 Did not register 

             educ_parsed  age4   educ4 \ 
cpsid                   
20121000000100      Bachelor's degree  65+ College grad 
20121000000100      Bachelor's degree  65+ College grad 
20121000000100      Bachelor's degree Under 30 College grad 
20120800000500 Associate's degree, academic program  45-64 College grad 
20120800000600  High school diploma or equivalent  65+ HS or less 

       race4 region4 gender 
cpsid         
20121000000100 White South Male 
20121000000100 White South Female 
20121000000100 White South Female 
20120800000500 White South Female 
20120800000600 White South Female 

predictions.head()

  a   b   c   d   e   f 
0 0.119534 0.336761 0.188023 0.136651 0.095342 0.123689 
1 0.148409 0.346429 0.134852 0.169661 0.087556 0.113093 
2 0.389586 0.195802 0.101738 0.085705 0.114612 0.112557 
3 0.277783 0.262079 0.180037 0.102030 0.071171 0.106900 
4 0.158404 0.396487 0.088064 0.079058 0.171540 0.106447 

只是爲了好玩,我專門只用頭列試過這樣:

pd.concat([data_2012.iloc[0:5], predictions.iloc[0:5]], axis=1, ignore_index=True) 

同樣的錯誤出現。

+0

這對我來說非常合適。你的熊貓的版本是什麼? – Ali

+0

我在版本0.18.0 – futuraprime

+0

可以請打印predictions.head()和data.head()? – Shovalt

回答

0

我也是0.18.0。這是我試過的,它的工作。這是你在做什麼?

import numpy as np 
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) 
Y = np.array([1, 1, 1, 2, 2, 2]) 
from sklearn.naive_bayes import GaussianNB 
clf = GaussianNB() 
clf.fit(X,Y) 
import pandas as pd 
data = pd.DataFrame(X) 
data['y']=Y 
predictions = pd.DataFrame([dict(zip(clf.classes_, l)) for l in clf.predict_proba(X)]) 
pd.concat([data, predictions], axis=1, ignore_index=True) 
0 1 2    3    4 
0 -1 -1 1 1.000000e+00 1.522998e-08 
1 -2 -1 1 1.000000e+00 3.775135e-11 
2 -3 -2 1 1.000000e+00 5.749523e-19 
3 1 1 2 1.522998e-08 1.000000e+00 
4 2 1 2 3.775135e-11 1.000000e+00 
5 3 2 2 5.749523e-19 1.000000e+00 
+0

這與我正在做的幾乎相同 - 唯一顯着的區別是分類器是從不同的數據集生成的。 – futuraprime

+0

這應該沒有任何作用。你可以用你想要的任何數據來訓練你的分類器。你可以添加更多的代碼嗎? – Ali

+0

增加了更多的代碼 - 我認爲基本上是整個事情。 – futuraprime

0

原來有一個相對簡單的解決方案:

predictions.index = data.index 
pd.concat([data, predictions], axis=1) 

現在,它完美的作品。不知道爲什麼它不會像我最初嘗試過的那樣工作。