在python中生成用於文本分類管道的PMML

我正在嘗試爲文本分類管道生成PMML（使用jpmml-sklearn）。代碼中的最後一行 - sklearn2pmml（Textpipeline，「TextMiningClassifier.pmml」，with_repr = True） - 崩潰。在python中生成用於文本分類管道的PMML

from sklearn.datasets import fetch_20newsgroups 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.linear_model import SGDClassifier 
from sklearn2pmml import PMMLPipeline 

categories = [ 
'alt.atheism', 
'talk.religion.misc', 
] 

print("Loading 20 newsgroups dataset for categories:") 
print(categories) 
data = fetch_20newsgroups(subset='train', categories=categories) 
print("%d documents" % len(data.filenames)) 
print("%d categories" % len(data.target_names)) 

Textpipeline = PMMLPipeline([ 
('vect', CountVectorizer()), 
('tfidf', TfidfTransformer()), 
('clf', SGDClassifier()), 
]) 

Textpipeline.fit(data.data, data.target) 

from sklearn2pmml import sklearn2pmml 

sklearn2pmml(Textpipeline, "TextMiningClassifier.pmml", with_repr = True)

看起來像sklearn2pmml（）不能將Textpipeline作爲輸入。該代碼適用於其他管道（示例在這裏：https://github.com/jpmml/sklearn2pmml），但不適用於上面的文本分類管道。所以我的問題是：如何爲文本分類問題生成PMML？

錯誤，我得到：

Jun 15, 2017 12:48:00 PM org.jpmml.sklearn.Main run 
INFO: Parsing PKL.. 
Jun 15, 2017 12:48:01 PM org.jpmml.sklearn.Main run 
INFO: Parsed PKL in 489 ms. 
Jun 15, 2017 12:48:01 PM org.jpmml.sklearn.Main run 
INFO: Converting.. 
Jun 15, 2017 12:48:01 PM sklearn2pmml.PMMLPipeline encodePMML 
WARNING: The 'target_field' attribute is not set. Assuming y as the name of the target field 
Jun 15, 2017 12:48:01 PM sklearn2pmml.PMMLPipeline initFeatures 
WARNING: The 'active_fields' attribute is not set. Assuming [x1] as the names of active fields 
Jun 15, 2017 12:48:01 PM org.jpmml.sklearn.Main run 
SEVERE: Failed to convert 
java.lang.IllegalArgumentException: The tokenizer object (null) is not Splitter 
at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:263) 
at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:164) 
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:124) 
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93) 
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:122) 
at org.jpmml.sklearn.Main.run(Main.java:144) 
at org.jpmml.sklearn.Main.main(Main.java:93) 

Exception in thread "main" java.lang.IllegalArgumentException: The tokenizer object (null) is not Splitter 
at sklearn.feature_extraction.text.CountVectorizer.getTokenizer(CountVectorizer.java:263) 
at sklearn.feature_extraction.text.CountVectorizer.encodeDefineFunction(CountVectorizer.java:164) 
at sklearn.feature_extraction.text.CountVectorizer.encodeFeatures(CountVectorizer.java:124) 
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93) 
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:122) 
at org.jpmml.sklearn.Main.run(Main.java:144) 
at org.jpmml.sklearn.Main.main(Main.java:93) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "C:\Data\Anaconda2\lib\site-packages\sklearn2pmml\__init__.py", line 142, in sklearn2pmml 
raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams") 
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

來源

2017-06-15 Nikhil Garge

您需要使用PMML兼容的文字符號化功能。默認的實現是sklearn2pmml.feature_extraction.text.Splitter類：

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn2pmml.feature_extraction.text import Splitter 
vectorizer = TfidfVectorizer(analyzer = "word", token_pattern = None, tokenizer = Splitter())

更多細節，並引用JPMML郵件列表可供選擇：https://groups.google.com/forum/#!topic/jpmml/wi-0rxzUn1o

來源

2017-06-15 08:12:56 user1808924

在python中生成用於文本分類管道的PMML

回答

相關問題