Apache Spark上的訓練邏輯迴歸模型的錯誤。 SPARK-5063

我想用Apache Spark構建Logistic迴歸模型。這是代碼。Apache Spark上的訓練邏輯迴歸模型的錯誤。 SPARK-5063

parsedData = raw_data.map(mapper) # mapper is a function that generates pair of label and feature vector as LabeledPoint object 
featureVectors = parsedData.map(lambda point: point.features) # get feature vectors from parsed data 
scaler = StandardScaler(True, True).fit(featureVectors) #this creates a standardization model to scale the features 
scaledData = parsedData.map(lambda lp: LabeledPoint(lp.label, scaler.transform(lp.features))) #trasform the features to scale mean to zero and unit std deviation 
modelScaledSGD = LogisticRegressionWithSGD.train(scaledData, iterations = 10)

但我得到這個錯誤：

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

我不知道如何解決這個問題。任何幫助將非常感激。

來源

2015-08-25 ashishsjsu

您看到的問題與我在How to use Java/Scala function from an action or a transformation?中描述的問題非常相似要變換您必須調用Scala函數，並且它需要訪問SparkContext因此您看到的錯誤。

處理此問題的標準方法是僅處理數據的必需部分，然後壓縮結果。

labels = parsedData.map(lambda point: point.label) 
featuresTransformed = scaler.transform(featureVectors) 

scaledData = (labels 
    .zip(featuresTransformed) 
    .map(lambda p: LabeledPoint(p[0], p[1]))) 

modelScaledSGD = LogisticRegressionWithSGD.train(...)

如果不打算實現基於MLlib組件自己的方法可能更容易使用高層次ML API。

編輯：

這裏有兩個可能的問題。

此時LogisticRegressionWithSGD支持only binomial分類（感謝eliasah指出了這一點）。如果您需要多標籤分類，您可以用LogisticRegressionWithLBFGS替換它。
StandardScaler僅支持密集向量，因此它的應用受到限制。

來源

2015-08-25 11:52:15 zero323

它給出了這個[error]（https://gist.github.com/eliasah/cc6287b4307123e5755a）。我從來沒有見過這個錯誤。 – eliasah

在1.4.1上正常工作。我將在稍後下載1.3.1並檢查是否可以重現此問題。 'StandardScaler'不適用於稀疏數據，但我不認爲這是這裏的問題。 – zero323

該解決方案對我來說聽起來合乎邏輯和正確，這就是爲什麼我對這個錯誤感到驚訝的原因。 – eliasah

Apache Spark上的訓練邏輯迴歸模型的錯誤。 SPARK-5063

回答

相關問題