2016-08-15 39 views
0

我試圖用pySpark 這裏是我的代碼LogisticRegressionwithLBFGS拋出錯誤有關不支持Mulitinomial分類

from pyspark.mllib.classification import LogisticRegressionWithLBFGS 
from time import time 
from pyspark.mllib.regression import LabeledPoint 
from numpy import array 


RES_DIR="/home/shaahmed115/Pet_Projects/DA/TwitterStream_US_Elections/Features/" 
sc= SparkContext('local','pyspark') 

data_file = RES_DIR + "training.txt" 
raw_data = sc.textFile(data_file) 

print "Train data size is {}".format(raw_data.count()) 


test_data_file = RES_DIR + "testing.txt" 
test_raw_data = sc.textFile(test_data_file) 

print "Test data size is {}".format(test_raw_data.count()) 

def parse_interaction(line): 
    line_split = line.split(",") 
    return LabeledPoint(float(line_split[0]), array([float(x) for x in line_split])) 

training_data = raw_data.map(parse_interaction) 
logit_model = LogisticRegressionWithLBFGS.train(training_data,iterations=10, numClasses=3) 

這是拋出一個錯誤來實現Logistic迴歸: 目前,邏輯迴歸與ElasticNet在ML封裝只支持二進制分類。在輸入數據集

下面找到3是我的數據集的一個示例: 2,1.0,1.0,1.0 0,1.0,1.0,1.0 1,0.0,0.0,0.0

第一個元素是這個班,其餘的是矢量。你可以看到有三個班。 有沒有可以使多項分類適用於此的解決方法?

回答

1

錯誤你看到

邏輯迴歸與ElasticNet在ML封裝僅支持二進制 分類。

很清楚。您可以使用mllib版本,支持多項:
org.apache.spark.mllib.classification.LogisticRegression

/** 
* Train a classification model for Multinomial/Binary Logistic Regression using 
* Limited-memory BFGS. Standard feature scaling and L2 regularization are used by default. 
* NOTE: Labels used in Logistic Regression should be {0, 1, ..., k - 1} 
* for k classes multi-label classification problem. 
* 
* Earlier implementations of LogisticRegressionWithLBFGS applies a regularization 
* penalty to all elements including the intercept. If this is called with one of 
* standard updaters (L1Updater, or SquaredL2Updater) this is translated 
* into a call to ml.LogisticRegression, otherwise this will use the existing mllib 
* GeneralizedLinearAlgorithm trainer, resulting in a regularization penalty to the 
* intercept. 
*/ 
相關問題