2016-11-11 209 views
3

我有一列短句子和一個分類變量列的spark數據框。我想對分類變量上的句子one-hot-encoding執行tf-idf,然後將其輸出到驅動程序中的稀疏矩陣(一旦它的尺寸更小)(對於scikit-learn模型)。pyspark:稀疏向量到scipy稀疏矩陣

什麼是以稀疏形式從火花中獲取數據的最佳方式?似乎在稀疏向量上只有一個toArray()方法,它輸出numpy數組。但是,文檔確實說scipy稀疏數組can be used in the place of spark sparse arrays.

請記住,tf_idf值實際上是一列稀疏數組。理想情況下,將所有這些特徵集成到一個大型稀疏矩陣中將會很好。

回答

5

一種可能的解決方案可以被表示爲如下:

  • 轉換功能,以RDD和提取載體:

    from pyspark.ml.linalg import SparseVector 
    from operator import attrgetter 
    
    df = sc.parallelize([ 
        (SparseVector(3, [0, 2], [1.0, 3.0]),), 
        (SparseVector(3, [1], [4.0]),) 
    ]).toDF(["features"]) 
    
    features = df.rdd.map(attrgetter("features")) 
    
  • 添加行指數:

    indexed_features = features.zipWithIndex() 
    
  • 變平至元組的RDD (i, j, value)

    def explode(row): 
        vec, i = row 
        for j, v in zip(vec.indices, vec.values): 
         yield i, j, v 
    
    entries = indexed_features.flatMap(explode) 
    
  • 收集和重塑:

    row_indices, col_indices, data = zip(*entries.collect()) 
    
  • 計算形狀:

    shape = (
        df.count(), 
        df.rdd.map(attrgetter("features")).first().size 
    ) 
    
  • 創建稀疏矩陣:

    from scipy.sparse import csr_matrix 
    
    mat = csr_matrix((data, (row_indices, col_indices)), shape=shape) 
    
  • 快速理智檢查:

    mat.todense() 
    

    隨着預期的結果:

    matrix([[ 1., 0., 3.], 
         [ 0., 4., 0.]]) 
    

還有一句:

  • 轉換的features每行矩陣:

    import numpy as np 
    
    def as_matrix(vec): 
        data, indices = vec.values, vec.indices 
        shape = 1, vec.size 
        return csr_matrix((data, indices, np.array([0, vec.values.size])), shape) 
    
    mats = features.map(as_matrix) 
    
  • ,並減少與vstack

    from scipy.sparse import vstack 
    
    mat = mats.reduce(lambda x, y: vstack([x, y])) 
    

    collectvstack

    mat = vstack(mats.collect())