Python - csr_matrix的數據結構

我正在研究TFIDF。我已經使用了tfidf_vectorizer.fit_transform。它返回一個csr_matrix，但我不明白結果的結構。Python - csr_matrix的數據結構

數據輸入：

文件=（「天空是藍色的」，「陽光燦爛」，「在天空陽光燦爛」，「我們可以看到，閃亮的陽光，燦爛的陽光」）

聲明：

tfidf_vectorizer = TfidfVectorizer() 
tfidf_matrix = tfidf_vectorizer.fit_transform(documents) 
print(tfidf_matrix)

其結果是：

（0,9）0.34399327143
（0,7）0.519713848879
（0,4）0.420753151645
（0，0） 0.659191117868
（1,9）0.426858009784
（1,4）0.522108621994
（1,8）0.522108621994
（1,1）0.522108621994
（2,9）0.526261040111
（2,7）0.397544332095
（2,4）0.32184639876
（2，8）0.32184639876
（2，1）0.32184639876
（2，3）0.504234576856
（3,9）0.390963088213
（3,8）0.47820398015
（3,1）0.239101990075
（3，10）0.374599471224
（3，2）0.374599471224
（3,5）0.374599471224
（3,6）0.374599471224

tfidf_matrix是csr_matrix。所以我在這找到了，但沒有結構與結果相同：scipy.sparse.csr_matrix

什麼結構的值爲（0，9）0.34399327143？

來源

2017-08-14 Brasc elok

這看起來像一個收集某種關於句子統計在列表中的矩陣（其中4）和獨特的字（11？）。例如，第一行有4個矩陣項，4個字。 'tfidt_matrix.A'應該以傳統的矩陣形式顯示它。 – hpaulj

@hpaulj：你能幫我寫下更詳細的矩陣嗎？ –

沒有矢量化，我可以重新創建矩陣，或多或少，這個順序操作：

In [703]: documents = ("The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun the bright sun")

得到的話（全部小寫）列出的清單：

In [704]: alist = [l.lower().split() for l in documents]

通過 alist和c

In [705]: aset = set() 
In [706]: [aset.update(l) for l in alist] 
Out[706]: [None, None, None, None] 
In [707]: unq = sorted(list(aset)) 
In [708]: unq 
Out[708]: 
['blue', 
'bright', 
'can', 
'in', 
'is', 
'see', 
'shining', 
'sky', 
'sun', 
'the', 
'we']

轉到：

得到詞的排序列表（唯一） ollect字數。 rows將語句編號，cols將是唯一字索引

In [709]: rows, cols, data = [],[],[] 
In [710]: for i,row in enumerate(alist): 
    ...:  for c in row: 
    ...:   rows.append(i) 
    ...:   cols.append(unq.index(c)) 
    ...:   data.append(1) 
    ...:

建立從這個數據稀疏矩陣：

In [711]: M = sparse.csr_matrix((data,(rows,cols))) 
In [712]: M 
Out[712]: 
<4x11 sparse matrix of type '<class 'numpy.int32'>' 
    with 21 stored elements in Compressed Sparse Row format> 
In [713]: print(M) 
    (0, 0) 1 
    (0, 4) 1 
    (0, 7) 1 
    (0, 9) 1 
    (1, 1) 1 
    .... 
    (3, 9) 2 
    (3, 10) 1 
In [714]: M.A  # viewed as 2d array 
Out[714]: 
array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], 
     [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0], 
     [0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0], 
     [0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32)

由於這是使用sklearn，我可以重現你的矩陣：

In [717]: from sklearn import feature_extraction 
In [718]: tf = feature_extraction.text.TfidfVectorizer() 
In [719]: tfM = tf.fit_transform(documents) 
In [720]: tfM 
Out[720]: 
<4x11 sparse matrix of type '<class 'numpy.float64'>' 
    with 21 stored elements in Compressed Sparse Row format> 
In [721]: print(tfM) 
    (0, 9) 0.34399327143 
    (0, 7) 0.519713848879 
    (0, 4) 0.420753151645 
    .... 
    (3, 5) 0.374599471224 
    (3, 6) 0.374599471224 
In [722]: tfM.A 
Out[722]: 
array([[ 0.65919112, 0.  , 0.  , 0.  , 0.42075315, 
     0.  , 0.  , 0.51971385, 0.  , 0.34399327, 
     0.  ],.... 
     [ 0.  , 0.23910199, 0.37459947, 0.  , 0.  , 
     0.37459947, 0.37459947, 0.  , 0.47820398, 0.39096309, 
     0.37459947]])

的實際數據被存儲爲3個屬性數組：

In [723]: tfM.indices 
Out[723]: 
array([ 9, 7, 4, 0, 9, 4, 8, 1, 9, 7, 4, 8, 1, 3, 9, 8, 1, 
     10, 2, 5, 6], dtype=int32) 
In [724]: tfM.data 
Out[724]: 
array([ 0.34399327, 0.51971385, 0.42075315, 0.65919112, 0.42685801, 
     ... 
     0.37459947]) 
In [725]: tfM.indptr 
Out[725]: array([ 0, 4, 8, 14, 21], dtype=int32)

對各行的indices值告訴我們哪些詞出現在了那句話：

In [726]: np.array(unq)[M[0,].indices] 
Out[726]: 
array(['blue', 'is', 'sky', 'the'], 
     dtype='<U7') 
In [727]: np.array(unq)[M[3,].indices] 
Out[727]: 
array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'], 
     dtype='<U7')

來源

2017-08-14 20:14:53 hpaulj

謝謝你，非常詳細和有幫助 –

你看到的只是字符串表示在調用print(my_csr_mat)時使用。它列出（在你的情況下）你矩陣中的所有nonzeros。（也許會有大量的nonzeros截斷輸出）。

由於這是一個稀疏矩陣，它有2個維度。

(0, 9) 0.34399327143

means：matrix-element @ position [0,9] is 0.34399327143。

小演示：

import numpy as np 
from scipy.sparse import csr_matrix 

matrix_dense = np.arange(20).reshape(4,5) 
zero_out = np.random.choice((0,1), size=(4,5), p=(0.7, 0.3)) 
matrix_dense_mod = matrix_dense * zero_out 

print(matrix_dense_mod) 

sparse_mat = csr_matrix(matrix_dense_mod) 

print(sparse_mat)

輸出：

[[ 0 0 2 0 4] 
[ 0 6 0 8 0] 
[ 0 11 0 13 14] 
[15 0 0 18 19]] 
    (0, 2)  2 
    (0, 4)  4 
    (1, 1)  6 
    (1, 3)  8 
    (2, 1)  11 
    (2, 3)  13 
    (2, 4)  14 
    (3, 0)  15 
    (3, 3)  18 
    (3, 4)  19

我不知道你So I find on this, but there are no structure as same as the result的意思，但要注意：在scipy.sparse文檔最例子有my_mat.toarray （），這意味着它正在用稀疏矩陣構建一個密集數組，該矩陣具有不同的字符串表示風格。

來源

2017-08-14 16:28:34 sascha

謝謝。我知道了 –

Python - csr_matrix的數據結構

回答

相關問題