Python稀疏矩陣除了一個刪除重複的索引？

我計算向量的矩陣之間的餘弦相似性，並且我得到的結果在一個稀疏矩陣是這樣的：Python稀疏矩陣除了一個刪除重複的索引？

（0，26）0.359171459261

（0，25）0.121145761751

（0，24）0.316922015914

（0，23）0.157622038039

（0，22）0.636466644041

（0，21）0.136216495731

（0，20）0.243164535496

（0，19）0.348272617805

（0，18）0.636466644041

（0，17）1.0

但也有重複，例如：

（0，24）0.316922015914和（24，0）0.316922015914

我想要做的就是通過指令去除它們（如果我有（0,24），那麼我不需要（24，0），因爲它是相同的）只剩下一個這個並刪除第二個是矩陣中的所有向量。目前，我有下面的代碼來創建矩陣：

vectorized_words = sparse.csr_matrix(vectorize_words(nostopwords,glove_dict)) 
cos_similiarity = cosine_similarity(vectorized_words,dense_output=False)

因此，要總結，我不希望刪除所有重複，我想會留下使用Python的方式只是其中之一。

預先感謝您！

來源

2017-03-31 nitheism

'vectorize_words'和'cosine_similarity'從哪裏來？在生成'cos_similarity'時刪除'duplicates'可能比在之後從矩陣中刪除它們更容易。'稀疏'矩陣不是爲單個元素操作而設計的。 – hpaulj

'scipy.spatial.distance.squareform'轉換爲/從一個緊湊的upper_triangle形式消除重複。我不知道是否有一個適用於稀疏矩陣的版本。 – hpaulj

@hpaulj cosine_similarity來自sklearn，矢量化單詞是我的函數來獲得每個單詞矢量 – nitheism

我認爲這是最容易獲得coo格式矩陣的上三角：

首先做一個小的對稱矩陣：

In [876]: A = sparse.random(5,5,.3,'csr') 
In [877]: A = A+A.T 
In [878]: A 
Out[878]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>' 
    with 11 stored elements in Compressed Sparse Row format> 
In [879]: A.A 
Out[879]: 
array([[ 0.  , 0.  , 0.81388978, 0.  , 0.  ], 
     [ 0.  , 0.  , 0.73944395, 0.20736975, 0.98968617], 
     [ 0.81388978, 0.73944395, 0.  , 0.  , 0.  ], 
     [ 0.  , 0.20736975, 0.  , 0.05581152, 0.04448881], 
     [ 0.  , 0.98968617, 0.  , 0.04448881, 0.  ]])

轉換爲coo，並設置較低的三角形數據值設爲0

In [880]: Ao = A.tocoo() 
In [881]: mask = (Ao.row>Ao.col) 
In [882]: mask 
Out[882]: 
array([False, False, False, False, True, True, True, False, False, 
     True, True], dtype=bool) 
In [883]: Ao.data[mask]=0

轉換回0，並使用eliminate_zeros修剪矩陣。

In [890]: A1 = Ao.tocsr() 
In [891]: A1 
Out[891]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>' 
    with 11 stored elements in Compressed Sparse Row format> 
In [892]: A1.eliminate_zeros() 
In [893]: A1 
Out[893]: 
<5x5 sparse matrix of type '<class 'numpy.float64'>' 
    with 6 stored elements in Compressed Sparse Row format> 
In [894]: A1.A 
Out[894]: 
array([[ 0.  , 0.  , 0.81388978, 0.  , 0.  ], 
     [ 0.  , 0.  , 0.73944395, 0.20736975, 0.98968617], 
     [ 0.  , 0.  , 0.  , 0.  , 0.  ], 
     [ 0.  , 0.  , 0.  , 0.05581152, 0.04448881], 
     [ 0.  , 0.  , 0.  , 0.  , 0.  ]])

兩者coo和csr格式具有就地eliminate_zeros方法。

def eliminate_zeros(self): 
    """Remove zero entries from the matrix 

    This is an *in place* operation 
    """ 
    mask = self.data != 0 
    self.data = self.data[mask] 
    self.row = self.row[mask] 
    self.col = self.col[mask]

而不是使用Ao.data[mask]=0，可以將這個代碼作爲消除只是lower_triangle值的模型。

來源

2017-03-31 19:38:04 hpaulj

不會「消除_zeros」刪除所有的零？我的意思是我可能在某個地方有一個來自原始矩陣的值，它也會將其刪除？ – nitheism

是的。我將添加'coo''demo_zeros'的代碼，以防你想直接使用'mask'工作。 – hpaulj

非常感謝 – nitheism

Python稀疏矩陣除了一個刪除重複的索引？

回答

相關問題