2015-10-19 330 views
8

我想將python scikit-learn模型導出爲PMML。將python scikit學習模型導入pmml

什麼python軟件包最適合?

我讀了約Augustus,但我無法找到任何使用scikit-learn模型的例子。

+0

您可以轉換使用[sklearn2pmml] Scikit,學習模式和變壓器PMML(https://github.com/jpmml/sklearn2pmml)包。 – user1808924

+0

JPMML-SkLearn也支持Python 2.7,但目前尚未公佈。 – user1808924

+0

jpmml-sklearn包支持python 3.4。有沒有一種替代方案支持python 2.7 – Selva

回答

8

SkLearn2PMML

圍繞JPMML-SkLearn命令行應用程序的薄包裝。有關支持的Scikit-Learn Estimator和Transformer類型的列表,請參閱JPMML-SkLearn項目的文檔。

正如@ user1808924所說,它支持Python 2.7或3.4+。它也需要Java 1.7+

通過安裝:(需要git)如何分類樹導出到PMML

pip install git+https://github.com/jpmml/sklearn2pmml.git 

例。首先,蘋果樹在成長:

# example tree & viz from http://scikit-learn.org/stable/modules/tree.html 
from sklearn import datasets, tree 
iris = datasets.load_iris() 
clf = tree.DecisionTreeClassifier() 
clf = clf.fit(iris.data, iris.target) 

有兩個部分的SkLearn2PMML轉換,一個估計(我們clf)和映射器(用於預處理步驟,如離散或PCA)。我們的映射器是非常基本的,因爲我們沒有做任何轉換。

from sklearn_pandas import DataFrameMapper 
default_mapper = DataFrameMapper([(i, None) for i in iris.feature_names + ['Species']]) 

from sklearn2pmml import sklearn2pmml 
sklearn2pmml(estimator=clf, 
      mapper=default_mapper, 
      pmml="D:/workspace/IrisClassificationTree.pmml") 

這是可能的(雖然不是記錄)通過mapper=None,但你會看到預測名迷路(返回x1沒有sepal length等)。

讓我們看看.pmml文件:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<PMML xmlns="http://www.dmg.org/PMML-4_3" version="4.3"> 
    <Header> 
     <Application name="JPMML-SkLearn" version="1.1.1"/> 
     <Timestamp>2016-09-26T19:21:43Z</Timestamp> 
    </Header> 
    <DataDictionary> 
     <DataField name="sepal length (cm)" optype="continuous" dataType="float"/> 
     <DataField name="sepal width (cm)" optype="continuous" dataType="float"/> 
     <DataField name="petal length (cm)" optype="continuous" dataType="float"/> 
     <DataField name="petal width (cm)" optype="continuous" dataType="float"/> 
     <DataField name="Species" optype="categorical" dataType="string"> 
      <Value value="setosa"/> 
      <Value value="versicolor"/> 
      <Value value="virginica"/> 
     </DataField> 
    </DataDictionary> 
    <TreeModel functionName="classification" splitCharacteristic="binarySplit"> 
     <MiningSchema> 
      <MiningField name="Species" usageType="target"/> 
      <MiningField name="sepal length (cm)"/> 
      <MiningField name="sepal width (cm)"/> 
      <MiningField name="petal length (cm)"/> 
      <MiningField name="petal width (cm)"/> 
     </MiningSchema> 
     <Output> 
      <OutputField name="probability_setosa" dataType="double" feature="probability" value="setosa"/> 
      <OutputField name="probability_versicolor" dataType="double" feature="probability" value="versicolor"/> 
      <OutputField name="probability_virginica" dataType="double" feature="probability" value="virginica"/> 
     </Output> 
     <Node id="1"> 
      <True/> 
      <Node id="2" score="setosa" recordCount="50.0"> 
       <SimplePredicate field="petal width (cm)" operator="lessOrEqual" value="0.8"/> 
       <ScoreDistribution value="setosa" recordCount="50.0"/> 
       <ScoreDistribution value="versicolor" recordCount="0.0"/> 
       <ScoreDistribution value="virginica" recordCount="0.0"/> 
      </Node> 
      <Node id="3"> 
       <SimplePredicate field="petal width (cm)" operator="greaterThan" value="0.8"/> 
       <Node id="4"> 
        <SimplePredicate field="petal width (cm)" operator="lessOrEqual" value="1.75"/> 
        <Node id="5"> 
         <SimplePredicate field="petal length (cm)" operator="lessOrEqual" value="4.95"/> 
         <Node id="6" score="versicolor" recordCount="47.0"> 
          <SimplePredicate field="petal width (cm)" operator="lessOrEqual" value="1.6500001"/> 
          <ScoreDistribution value="setosa" recordCount="0.0"/> 
          <ScoreDistribution value="versicolor" recordCount="47.0"/> 
          <ScoreDistribution value="virginica" recordCount="0.0"/> 
         </Node> 
         <Node id="7" score="virginica" recordCount="1.0"> 
          <SimplePredicate field="petal width (cm)" operator="greaterThan" value="1.6500001"/> 
          <ScoreDistribution value="setosa" recordCount="0.0"/> 
          <ScoreDistribution value="versicolor" recordCount="0.0"/> 
          <ScoreDistribution value="virginica" recordCount="1.0"/> 
         </Node> 
        </Node> 
        <Node id="8"> 
         <SimplePredicate field="petal length (cm)" operator="greaterThan" value="4.95"/> 
         <Node id="9" score="virginica" recordCount="3.0"> 
          <SimplePredicate field="petal width (cm)" operator="lessOrEqual" value="1.55"/> 
          <ScoreDistribution value="setosa" recordCount="0.0"/> 
          <ScoreDistribution value="versicolor" recordCount="0.0"/> 
          <ScoreDistribution value="virginica" recordCount="3.0"/> 
         </Node> 
         <Node id="10"> 
          <SimplePredicate field="petal width (cm)" operator="greaterThan" value="1.55"/> 
          <Node id="11" score="versicolor" recordCount="2.0"> 
           <SimplePredicate field="sepal length (cm)" operator="lessOrEqual" value="6.95"/> 
           <ScoreDistribution value="setosa" recordCount="0.0"/> 
           <ScoreDistribution value="versicolor" recordCount="2.0"/> 
           <ScoreDistribution value="virginica" recordCount="0.0"/> 
          </Node> 
          <Node id="12" score="virginica" recordCount="1.0"> 
           <SimplePredicate field="sepal length (cm)" operator="greaterThan" value="6.95"/> 
           <ScoreDistribution value="setosa" recordCount="0.0"/> 
           <ScoreDistribution value="versicolor" recordCount="0.0"/> 
           <ScoreDistribution value="virginica" recordCount="1.0"/> 
          </Node> 
         </Node> 
        </Node> 
       </Node> 
       <Node id="13"> 
        <SimplePredicate field="petal width (cm)" operator="greaterThan" value="1.75"/> 
        <Node id="14"> 
         <SimplePredicate field="petal length (cm)" operator="lessOrEqual" value="4.8500004"/> 
         <Node id="15" score="virginica" recordCount="2.0"> 
          <SimplePredicate field="sepal width (cm)" operator="lessOrEqual" value="3.1"/> 
          <ScoreDistribution value="setosa" recordCount="0.0"/> 
          <ScoreDistribution value="versicolor" recordCount="0.0"/> 
          <ScoreDistribution value="virginica" recordCount="2.0"/> 
         </Node> 
         <Node id="16" score="versicolor" recordCount="1.0"> 
          <SimplePredicate field="sepal width (cm)" operator="greaterThan" value="3.1"/> 
          <ScoreDistribution value="setosa" recordCount="0.0"/> 
          <ScoreDistribution value="versicolor" recordCount="1.0"/> 
          <ScoreDistribution value="virginica" recordCount="0.0"/> 
         </Node> 
        </Node> 
        <Node id="17" score="virginica" recordCount="43.0"> 
         <SimplePredicate field="petal length (cm)" operator="greaterThan" value="4.8500004"/> 
         <ScoreDistribution value="setosa" recordCount="0.0"/> 
         <ScoreDistribution value="versicolor" recordCount="0.0"/> 
         <ScoreDistribution value="virginica" recordCount="43.0"/> 
        </Node> 
       </Node> 
      </Node> 
     </Node> 
    </TreeModel> 
</PMML> 

第一分割(節點1)是花瓣寬度在0.8。節點2(花瓣寬度< = 0.8)捕獲所有的setosa,沒有別的。

您可以比較PMML輸出到graphviz輸出:

from sklearn.externals.six import StringIO 
import pydotplus # this might be pydot for python 2.7 
dot_data = StringIO() 
tree.export_graphviz(clf, 
        out_file=dot_data, 
        feature_names=iris.feature_names, 
        class_names=iris.target_names, 
        filled=True, rounded=True, 
        special_characters=True) 
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
graph.write_pdf("D:/workspace/iris.pdf") 
# for in-line display, you can also do: 
# from IPython.display import Image 
# Image(graph.create_png()) 

enter image description here

+0

有沒有一種方法可以在不使用映射器時保留預測變量名稱?我真的需要在評估者方面瞭解他們,但僅僅爲此構建一個映射器太過於過分了。 – KidCrippler

+0

@K我無法弄清楚如何保留沒有映射器的預測變量名稱。您可以嘗試發佈問題。 – C8H10N4O2

+3

答案似乎已過時:'sklearn2pmml'現在使用'PMMLPipeline'。 – sds