2017-08-14 53 views
2

我正在研究TFIDF。我已經使用了tfidf_vectorizer.fit_transform。它返回一個csr_matrix,但我不明白結果的結構。Python - csr_matrix的數據結構

  • 數據輸入:

文件=(「天空是藍色的」,「陽光燦爛」,「在 天空陽光燦爛」,「我們可以看到,閃亮的陽光,燦爛的陽光」)

  • 聲明:
tfidf_vectorizer = TfidfVectorizer() 
tfidf_matrix = tfidf_vectorizer.fit_transform(documents) 
print(tfidf_matrix) 
  • 其結果是:

(0,9)0.34399327143
(0,7)0.519713848879
(0,4)0.420753151645
(0,0) 0.659191117868
(1,9)0.426858009784
(1,4)0.522108621994
(1,8)0.522108621994
(1,1)0.522108621994
(2,9)0.526261040111
(2,7)0.397544332095
(2,4)0.32184639876
(2,8)0.32184639876
(2,1)0.32184639876
(2,3)0.504234576856
(3,9)0.390963088213
(3,8)0.47820398015
(3,1)0.239101990075
(3,10)0.374599471224
(3,2)0.374599471224
(3,5)0.374599471224
(3,6)0.374599471224

tfidf_matrix是csr_matrix。所以我在這找到了,但沒有結構與結果相同:scipy.sparse.csr_matrix

什麼結構的值爲(0,9)0.34399327143?

+1

這看起來像一個收集某種關於句子統計在列表中的矩陣(其中4)和獨特的字(11?)。例如,第一行有4個矩陣項,4個字。 'tfidt_matrix.A'應該以傳統的矩陣形式顯示它。 – hpaulj

+0

@hpaulj:你能幫我寫下更詳細的矩陣嗎? –

回答

2

沒有矢量化,我可以重新創建矩陣,或多或少,這個順序操作:

In [703]: documents = ("The sky is blue", "The sun is bright", "The sun in the sky is bright", "We can see the shining sun the bright sun") 

得到的話(全部小寫)列出的清單:

In [704]: alist = [l.lower().split() for l in documents] 
通過 alist和c

In [705]: aset = set() 
In [706]: [aset.update(l) for l in alist] 
Out[706]: [None, None, None, None] 
In [707]: unq = sorted(list(aset)) 
In [708]: unq 
Out[708]: 
['blue', 
'bright', 
'can', 
'in', 
'is', 
'see', 
'shining', 
'sky', 
'sun', 
'the', 
'we'] 

轉到:

得到詞的排序列表(唯一) ollect字數。 rows將語句編號,cols將是唯一字索引

In [709]: rows, cols, data = [],[],[] 
In [710]: for i,row in enumerate(alist): 
    ...:  for c in row: 
    ...:   rows.append(i) 
    ...:   cols.append(unq.index(c)) 
    ...:   data.append(1) 
    ...:   

建立從這個數據稀疏矩陣:

In [711]: M = sparse.csr_matrix((data,(rows,cols))) 
In [712]: M 
Out[712]: 
<4x11 sparse matrix of type '<class 'numpy.int32'>' 
    with 21 stored elements in Compressed Sparse Row format> 
In [713]: print(M) 
    (0, 0) 1 
    (0, 4) 1 
    (0, 7) 1 
    (0, 9) 1 
    (1, 1) 1 
    .... 
    (3, 9) 2 
    (3, 10) 1 
In [714]: M.A  # viewed as 2d array 
Out[714]: 
array([[1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], 
     [0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0], 
     [0, 1, 0, 1, 1, 0, 0, 1, 1, 2, 0], 
     [0, 1, 1, 0, 0, 1, 1, 0, 2, 2, 1]], dtype=int32) 

由於這是使用sklearn,我可以重現你的矩陣:

In [717]: from sklearn import feature_extraction 
In [718]: tf = feature_extraction.text.TfidfVectorizer() 
In [719]: tfM = tf.fit_transform(documents) 
In [720]: tfM 
Out[720]: 
<4x11 sparse matrix of type '<class 'numpy.float64'>' 
    with 21 stored elements in Compressed Sparse Row format> 
In [721]: print(tfM) 
    (0, 9) 0.34399327143 
    (0, 7) 0.519713848879 
    (0, 4) 0.420753151645 
    .... 
    (3, 5) 0.374599471224 
    (3, 6) 0.374599471224 
In [722]: tfM.A 
Out[722]: 
array([[ 0.65919112, 0.  , 0.  , 0.  , 0.42075315, 
     0.  , 0.  , 0.51971385, 0.  , 0.34399327, 
     0.  ],.... 
     [ 0.  , 0.23910199, 0.37459947, 0.  , 0.  , 
     0.37459947, 0.37459947, 0.  , 0.47820398, 0.39096309, 
     0.37459947]]) 

的實際數據被存儲爲3個屬性數組:

In [723]: tfM.indices 
Out[723]: 
array([ 9, 7, 4, 0, 9, 4, 8, 1, 9, 7, 4, 8, 1, 3, 9, 8, 1, 
     10, 2, 5, 6], dtype=int32) 
In [724]: tfM.data 
Out[724]: 
array([ 0.34399327, 0.51971385, 0.42075315, 0.65919112, 0.42685801, 
     ... 
     0.37459947]) 
In [725]: tfM.indptr 
Out[725]: array([ 0, 4, 8, 14, 21], dtype=int32) 

對各行的indices值告訴我們哪些詞出現在了那句話:

In [726]: np.array(unq)[M[0,].indices] 
Out[726]: 
array(['blue', 'is', 'sky', 'the'], 
     dtype='<U7') 
In [727]: np.array(unq)[M[3,].indices] 
Out[727]: 
array(['bright', 'can', 'see', 'shining', 'sun', 'the', 'we'], 
     dtype='<U7') 
+0

謝謝你,非常詳細和有幫助 –

3

你看到的只是字符串表示在調用print(my_csr_mat)時使用。它列出(在你的情況下)你矩陣中的所有nonzeros。 (也許會有大量的nonzeros截斷輸出)。

由於這是一個稀疏矩陣,它有2個維度。

(0, 9) 0.34399327143 

means:matrix-element @ position [0,9] is 0.34399327143。

小演示:

import numpy as np 
from scipy.sparse import csr_matrix 

matrix_dense = np.arange(20).reshape(4,5) 
zero_out = np.random.choice((0,1), size=(4,5), p=(0.7, 0.3)) 
matrix_dense_mod = matrix_dense * zero_out 

print(matrix_dense_mod) 

sparse_mat = csr_matrix(matrix_dense_mod) 

print(sparse_mat) 

輸出:

[[ 0 0 2 0 4] 
[ 0 6 0 8 0] 
[ 0 11 0 13 14] 
[15 0 0 18 19]] 
    (0, 2)  2 
    (0, 4)  4 
    (1, 1)  6 
    (1, 3)  8 
    (2, 1)  11 
    (2, 3)  13 
    (2, 4)  14 
    (3, 0)  15 
    (3, 3)  18 
    (3, 4)  19 

我不知道你So I find on this, but there are no structure as same as the result的意思,但要注意:在scipy.sparse文檔最例子有my_mat.toarray (),這意味着它正在用稀疏矩陣構建一個密集數組,該矩陣具有不同的字符串表示風格

+0

謝謝。我知道了 –