1
我嘗試應用LSH(https://github.com/soundcloud/cosine-lsh-join-spark)來計算某些向量的餘弦相似度。對於我的真實數據,我擁有2M行(文檔)和30K屬性。此外,該矩陣非常稀疏。爲了給出一個樣本,讓我們說我的數據如下:通過使用稀疏矩陣而不是密集矩陣應用LSH方法
D1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D2 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
D3 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1
D4 ...
在相關代碼的功能都放在一個密集的向量如下:
val input = "text.txt"
val conf = new SparkConf()
.setAppName("LSH-Cosine")
.setMaster("local[4]")
val storageLevel = StorageLevel.MEMORY_AND_DISK
val sc = new SparkContext(conf)
// read in an example data set of word embeddings
val data = sc.textFile(input, numPartitions).map {
line =>
val split = line.split(" ")
val word = split.head
val features = split.tail.map(_.toDouble)
(word, features)
}
// create an unique id for each word by zipping with the RDD index
val indexed = data.zipWithIndex.persist(storageLevel)
// create indexed row matrix where every row represents one word
val rows = indexed.map {
case ((word, features), index) =>
IndexedRow(index, Vectors.dense(features))
}
我想要做的是使用稀疏矩陣而不是使用密集的。如何調整'Vectors.dense(特徵)'?