這是不太可能的錯誤。您沒有提供code required to reproduce the issue,但很可能您使用Spark 2.0與ML變換器並且比較錯誤的實體。
讓我們說明了用一個例子。簡單的數據
from pyspark.ml.feature import OneHotEncoder
row = OneHotEncoder(inputCol="x", outputCol="features").transform(
sc.parallelize([(1.0,)]).toDF(["x"])
).first()
現在讓進口不同載體類:
from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
,並測試:
isinstance(row.features, MLLibVector)
False
isinstance(row.features, MLVector)
True
正如你看到的,我們有什麼pyspark.ml.linalg.Vector
不pyspark.mllib.linalg.Vector
這是不符合舊的API兼容:
LabeledPoint(0.0, row.features)
TypeError Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector
你可以ML對象轉換爲MLLib之一:
from pyspark.ml import linalg as ml_linalg
def as_mllib(v):
if isinstance(v, ml_linalg.SparseVector):
return MLLibVectors.sparse(v.size, v.indices, v.values)
elif isinstance(v, ml_linalg.DenseVector):
return MLLibVectors.dense(v.toArray())
else:
raise TypeError("Unsupported type: {0}".format(type(v)))
LabeledPoint(0, as_mllib(row.features))
LabeledPoint(0.0, (1,[],[]))
或簡單地說:
LabeledPoint(0, MLLibVectors.fromML(row.features))
LabeledPoint(0.0, (1,[],[]))
但一般來說,你應該避免有必要的情況。
非常感謝。它解決了!我非常感謝! –