2015-02-05 114 views
2

我通過RPy2在R中使用randomForest庫。我想回傳使用caretpredict方法計算的值,並將它們連接到原始的pandas數據框。見下面的例子。Rpy2和熊貓:從預測到熊貓數據幀加入輸出

import pandas as pd 
import numpy as np 
import rpy2.robjects as robjects 
from rpy2.robjects import pandas2ri 
pandas2ri.activate() 
r = robjects.r 
r.library("randomForest") 
r.library("caret") 

df = pd.DataFrame(data=np.random.rand(100, 10), columns=["a{}".format(i) for i in range(10)]) 
df["b"] = ['a' if x < 0.5 else 'b' for x in np.random.sample(size=100)] 
train = df.ix[df.a0 < .75] 
withheld = df.ix[df.a0 >= .75] 

rf = r.randomForest(robjects.Formula('b ~ .'), data=train) 
pr = r.predict(rf, withheld) 
print pr.rx() 

它返回

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 
a a b b b a a a a b a a a a a b a a a a 
Levels: a b 

但如何才能join這給withheld數據幀或比較原始值?

我已經試過這樣:

import pandas.rpy.common as com 
com.convert_robj(pr) 

但這返回一個字典,其中鍵是字符串。我想有一個工作圍繞withheld.reset_index(),然後將字典鍵轉換爲整數,然後加入兩個,但必須有一個更簡單的方法!

回答

3

有熊貓到a pull-request that adds R factor to Pandas Categorical functionality 。它尚未合併到熊貓主分支中。如果是,

import pandas.rpy.common as rcom 
rcom.convert_robj(pr) 

會將pr轉換爲熊貓分類。

def convert_factor(obj): 
    """ 
    Taken from jseabold's PR: https://github.com/pydata/pandas/pull/9187 
    """ 
    ordered = r["is.ordered"](obj)[0] 
    categories = list(obj.levels) 
    codes = np.asarray(obj) - 1 # zero-based indexing 
    values = pd.Categorical.from_codes(codes, categories=categories, 
             ordered=ordered) 
    return values 

例如,

import pandas as pd 
import numpy as np 
import rpy2.robjects as robjects 
from rpy2.robjects import pandas2ri 
pandas2ri.activate() 
r = robjects.r 
r.library("randomForest") 
r.library("caret") 

def convert_factor(obj): 
    """ 
    Taken from jseabold's PR: https://github.com/pydata/pandas/pull/9187 
    """ 
    ordered = r["is.ordered"](obj)[0] 
    categories = list(obj.levels) 
    codes = np.asarray(obj) - 1 # zero-based indexing 
    values = pd.Categorical.from_codes(codes, categories=categories, 
             ordered=ordered) 
    return values 


df = pd.DataFrame(data=np.random.rand(100, 10), 
        columns=["a{}".format(i) for i in range(10)]) 
df["b"] = ['a' if x < 0.5 else 'b' for x in np.random.sample(size=100)] 
train = df.ix[df.a0 < .75] 
withheld = df.ix[df.a0 >= .75] 

rf = r.randomForest(robjects.Formula('b ~ .'), data=train) 
pr = convert_factor(r.predict(rf, withheld)) 

withheld['pr'] = pr 
print(withheld) 
1

由函數predict返回將R對象pr是一個 「載體」,它可以認爲: 直到這時,可以作爲一種解決方法使用作爲Python array.array,或numpy一維數組。

「加入」是不必要的,因爲pr中元素的排序對應於表withheld中的行。人們只需要添加pr作爲附加列withheld (見Adding new column to existing DataFrame in Python pandas):

withheld['predictions'] = pd.Series(pr, 
            index=withheld.index) 

默認情況下這將增加整數的柱(因爲ř因素被編碼爲整數)。一個可以自定義rpy2的轉換,而只是 (見http://rpy.sourceforge.net/rpy2/doc-2.5/html/robjects_convert.html):

注: rpy2的2.6.0版本將包括大熊貓Categorical向量的處理,使得下面不必要描述的轉換器的定製。

@robjects.conversion.ri2py.register(robjects.rinterface.SexpVector) 
def ri2py_vector(vector): 
    # based on 
    # https://bitbucket.org/rpy2/rpy2/src/a75413b09852991869332da615fa754923c32039/rpy/robjects/pandas2ri.py?at=default#cl-73 

    # special case for factors 
    if 'factor' in vector.rclass: 
     res = pd.Categorical.from_codes(np.asarray(vector) - 1, 
             categories = vector.do_slot('levels'), 
             ordered = 'ordered' in vector.rclass) 
    else: 
     # use the numpy converter first 
     res = numpy2ri.ri2py(obj) 
    if isinstance(res, recarray): 
     res = PandasDataFrame.from_records(res) 
    return res 

由此,任何rpy2對象到一個非rpy2對象的轉換將返回一個大熊貓Categorical每當有一個R因子:

robjects.conversion.ri2py(pr) 

您可決定增加的結果這是最後一次轉換到您的數據表。

請注意,轉換到非rpy2對象必須是顯式的(一個是調用轉換器)。如果你使用的是ipython,有一種方法可以使這個隱含的: https://gist.github.com/lgautier/e2e8709776e0e0e93b8d (和原始線程https://bitbucket.org/rpy2/rpy2/issue/230/rmagic-specific-conversion)。