2017-05-09 62 views
1

是否可以加載預訓練(二進制)模型進行spark(使用scala)?我試圖加載這樣的谷歌生成的二進制模型之一:在Spark中加載Word2Vec模型

import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} 


    val model = Word2VecModel.load(sc, "GoogleNews-vectors-negative300.bin") 

但它無法找到元數據目錄。我也創建了該文件夾並在其中附加了二進制文件,但無法解析。我沒有找到這個問題的任何包裝。

回答

0

我寫了一個快速的功能在谷歌新聞預訓練模型加載到火花word2vec模型。請享用。

def loadBin(file: String) = { 
    def readUntil(inputStream: DataInputStream, term: Char, maxLength: Int = 1024 * 8): String = { 
    var char: Char = inputStream.readByte().toChar 
    val str = new StringBuilder 
    while (!char.equals(term)) { 
     str.append(char) 
     assert(str.size < maxLength) 
     char = inputStream.readByte().toChar 
    } 
    str.toString 
    } 
    val inputStream: DataInputStream = new DataInputStream(new GZIPInputStream(new FileInputStream(file))) 
    try { 
    val header = readUntil(inputStream, '\n') 
    val (records, dimensions) = header.split(" ") match { 
     case Array(records, dimensions) => (records.toInt, dimensions.toInt) 
    } 
    new Word2VecModel((0 until records).toArray.map(recordIndex => { 
     readUntil(inputStream, ' ') -> (0 until dimensions).map(dimensionIndex => { 
     java.lang.Float.intBitsToFloat(java.lang.Integer.reverseBytes(inputStream.readInt())) 
     }).toArray 
    }).toMap) 
    } finally { 
    inputStream.close() 
    } 
}