無法將類型<類的pyspark.ml.linalg.SparseVector'>成矢量

鑑於我pyspark Row對象：無法將類型<類的pyspark.ml.linalg.SparseVector'>成矢量

>>> row 
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})) 
>>> row.clicked 
0 
>>> row.features 
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}) 
>>> type(row.features) 
<class 'pyspark.ml.linalg.SparseVector'>

然而，row.features未能通過isinstance（row.features，矢量）測試。

>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector) 
True 
>>> isinstance(row.features, Vector) 
False 
>>> isinstance(deepcopy(row.features), Vector) 
False

這個奇怪的錯誤讓我陷入了巨大的麻煩。如果沒有通過「isinstance（row.features，Vector）」，我無法使用map函數生成LabeledPoint。如果有人能解決這個問題，我將非常感激。

來源

2016-12-10 Jack Lv

這是不太可能的錯誤。您沒有提供code required to reproduce the issue，但很可能您使用Spark 2.0與ML變換器並且比較錯誤的實體。

讓我們說明了用一個例子。簡單的數據

from pyspark.ml.feature import OneHotEncoder 

row = OneHotEncoder(inputCol="x", outputCol="features").transform(
    sc.parallelize([(1.0,)]).toDF(["x"]) 
).first()

現在讓進口不同載體類：

from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors 
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors 
from pyspark.mllib.regression import LabeledPoint

，並測試：

isinstance(row.features, MLLibVector)

False

isinstance(row.features, MLVector)

True

正如你看到的，我們有什麼pyspark.ml.linalg.Vector不pyspark.mllib.linalg.Vector這是不符合舊的API兼容：

LabeledPoint(0.0, row.features)

TypeError         Traceback (most recent call last) 
... 
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector

你可以ML對象轉換爲MLLib之一：

from pyspark.ml import linalg as ml_linalg 

def as_mllib(v): 
    if isinstance(v, ml_linalg.SparseVector): 
     return MLLibVectors.sparse(v.size, v.indices, v.values) 
    elif isinstance(v, ml_linalg.DenseVector): 
     return MLLibVectors.dense(v.toArray()) 
    else: 
     raise TypeError("Unsupported type: {0}".format(type(v))) 

LabeledPoint(0, as_mllib(row.features))

LabeledPoint(0.0, (1,[],[]))

或簡單地說：

LabeledPoint(0, MLLibVectors.fromML(row.features))

LabeledPoint(0.0, (1,[],[]))

但一般來說，你應該避免有必要的情況。

來源

2016-12-10 13:52:51 user6910411

非常感謝。它解決了！我非常感謝！ –

如果您只是想將sparseVectors從pyspark.ml轉換爲pyspark.mllib SparseVectors，您可以使用MLUtils。說df是您的數據框，並且SparseVectors的列名爲「功能」。然後下面的幾行字讓你做到這一點：

from pyspark.mllib.utils import MLUtils 
df = MLUtils.convertVectorColumnsFromML(df, "features")

此問題發生對我來說，因爲從pyspark.ml.feature使用CountVectorizer時，我無法創建，因爲從pyspark.ml

與斯帕塞夫克託的不相容性LabeledPoints，

我不知道爲什麼他們的最新文檔CountVectorizer沒有使用「新」SparseVector類。由於分類算法需要LabeledPoints，這對我來說毫無意義...

UPDATE：我誤解了ml庫是爲DataFrame-Objects設計的，而mllib庫是爲RDD對象設計的。自Spark> 2.0以來，建議使用DataFrame數據結構，因爲SparkSession比SparkContext更加兼容（但存儲SparkContext對象）並且確實傳遞了DataFrame而不是RDD。我發現這個帖子，讓我的「哈哈」的影響：mllib and ml。感謝Alberto Bonsanto :)。

使用f.e.來自mllib的NaiveBayes，我不得不將我的DataFrame轉換成一個來自mllib的NaiveBayes的LabeledPoint對象。

但是從ml使用NaiveBayes更容易，因爲您不需要LabeledPoints，但可以爲數據框指定功能和類col。

PS：我一直在努力解決這個問題幾個小時，所以我覺得我需要把它發佈在這裏:)

來源

2017-01-26 20:41:21 Matze

無法將類型<類的pyspark.ml.linalg.SparseVector'>成矢量

回答

相關問題