Spark MLib - 從RDD [矢量]特徵和RDD [矢量]標籤創建LabeledPoint

我正在構建使用兩個代表文檔和標籤的文本文件的訓練集。Spark MLib - 從RDD [矢量]特徵和RDD [矢量]標籤創建LabeledPoint

Documents.txt

hello world 
hello mars

0 1

我在這些文件中讀取信息，並轉換在我的文檔數據傳送到tf-idf加權term-document matrix，其被表示爲RDD[Vector] Labels.txt 。我也看過，並創造了我的標籤一RDD[Vector]：

val docs: RDD[Seq[String]] = sc.textFile("Documents.txt").map(_.split(" ").toSeq) val labs: RDD[Vector] = sc.textFile("Labels.txt") .map(s => Vectors.dense(s.split(',').map(_.toDouble))) val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(docs) tf.cache() val idf = new IDF(minDocFreq = 3).fit(tf) val tfidf: RDD[Vector] = idf.transform(tf)

我想用tfidf和labs創建RDD[LabeledPoint]，但我不知道如何運用兩種不同的RDDs的映射。這是甚至可能/有效，還是我需要重新考慮我的方法？處理這種

來源

2016-03-11 Brian Vanover

您應該'join'兩個'RDD's。 –

@AlbertoBonsanto我正在考慮這種方法，但是如果兩個'RDD'都沒有'keys'來'join'，我怎麼能這樣做呢？ –

一種方法是join基於指數：

import org.apache.spark.RangePartitioner 

// Add indices 
val idfIndexed = idf.zipWithIndex.map(_.swap) 
val labelsIndexed = labels.zipWithIndex.map(_.swap) 

// Create range partitioner on larger RDD 
val partitioner = new RangePartitioner(idfIndexed.partitions.size, idfIndexed) 

// Join with custom partitioner 
labelsIndexed.join(idfIndexed, partitioner).values

來源

2016-03-11 22:10:22 zero323

Spark MLib - 從RDD [矢量]特徵和RDD [矢量]標籤創建LabeledPoint

回答

相關問題