通過使用稀疏矩陣而不是密集矩陣應用LSH方法

我嘗試應用LSH（https://github.com/soundcloud/cosine-lsh-join-spark）來計算某些向量的餘弦相似度。對於我的真實數據，我擁有2M行（文檔）和30K屬性。此外，該矩陣非常稀疏。爲了給出一個樣本，讓我們說我的數據如下：通過使用稀疏矩陣而不是密集矩陣應用LSH方法

D1 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
D2 0 0 1 1 0 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 
D3 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 
D4 ...

在相關代碼的功能都放在一個密集的向量如下：

val input = "text.txt" 
    val conf = new SparkConf() 
     .setAppName("LSH-Cosine") 
     .setMaster("local[4]") 
    val storageLevel = StorageLevel.MEMORY_AND_DISK 
    val sc = new SparkContext(conf) 

    // read in an example data set of word embeddings 
    val data = sc.textFile(input, numPartitions).map { 
     line => 
     val split = line.split(" ") 
     val word = split.head 
     val features = split.tail.map(_.toDouble) 
     (word, features) 
    } 

    // create an unique id for each word by zipping with the RDD index 
    val indexed = data.zipWithIndex.persist(storageLevel) 

    // create indexed row matrix where every row represents one word 
    val rows = indexed.map { 
     case ((word, features), index) => 
     IndexedRow(index, Vectors.dense(features)) 
    }

我想要做的是使用稀疏矩陣而不是使用密集的。如何調整'Vectors.dense（特徵）'？

來源

2016-02-01 mlee_jordan

稀疏矢量的等價工廠方法是Vectors.sparse，它需要一個索引數組和非零條目值的相應數組。 cosine-lsh-join-spark庫中的方法簽名基於通用的Vector類，所以看起來該庫將接受稀疏或稠密向量。

來源

2016-02-25 18:22:10

通過使用稀疏矩陣而不是密集矩陣應用LSH方法

回答

相關問題