2013-07-20 58 views
2

我使用這個gensim教程來查找文本之間的相似性。以下是代碼python gensim:indices數組有非整數dtype(float64)

from gensim import corpora, models, similarities 
from gensim.models import hdpmodel, ldamodel 
from itertools import izip 

import logging 
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) 

''' 
documents = ["Human machine interface for lab abc computer applications", 
       "bags loose tea water second ingredient tastes water", 
       "The EPS user interface management system", 
       "System and human system engineering testing of EPS", 
       "Relation of user perceived response time to error measurement", 
       "The generation of random binary unordered trees", 
       "The intersection graph of paths in trees", 
       "Graph minors IV Widths of trees and well quasi ordering", 
       "Graph minors A survey", 
       "red cow butter oil"] 
''' 
documents = ["Human machine interface for lab abc computer applications", 
       "bags loose tea water second ingredient tastes water"] 

# remove common words and tokenize 
stoplist = set('for a of the and to in'.split()) 
texts = [[word for word in document.lower().split() if word not in stoplist] 
     for document in documents] 

# remove words that appear only once 
all_tokens = sum(texts, []) 
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1) 
texts = [[word for word in text if word not in tokens_once] 
     for text in texts] 

dictionary = corpora.Dictionary(texts) 
corpus = [dictionary.doc2bow(text) for text in texts] 

#print corpus 

tfidf = models.TfidfModel(corpus) 

#print tfidf 

corpus_tfidf = tfidf[corpus] 

#print corpus_tfidf 

lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) 
lsi.print_topics(1) 

lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2) 
lda.print_topics(1) 

corpora.MmCorpus.serialize('dict.mm', corpus) 
corpus = corpora.MmCorpus('dict.mm') 
#print corpus 

lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2) 
doc = "human computer interaction" 
vec_bow = dictionary.doc2bow(doc.lower().split()) 
vec_lsi = lsi[vec_bow] 
#print vec_lsi 

index = similarities.MatrixSimilarity(lsi[corpus]) 
index.save('dict.index') 
index = similarities.MatrixSimilarity.load('dict.index') 

sims = index[vec_lsi] 
#print list(enumerate(sims)) 

sims = sorted(enumerate(sims),key=lambda item: -item[1]) 
for sim in sims: 
    print documents[sim[0]], " ==> ", sim[1] 

這裏有兩個文件。一個有10個文本,另一個有2個。一個被註釋掉。如果我使用第一個文檔列表,一切都很好,並生成有意義的輸出。如果我使用第二個文檔列表(有兩個文本)發生錯誤。這是它

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:122: UserWarning: indices array has non-integer dtype (float64) 
% self.indices.dtype.name) 

這個錯誤背後的原因是什麼,我該如何解決它? 我正在使用一臺64位機器。

回答

2

這可能是由於您的第二個列表爲[[], ['water']],在您刪除單例時嘗試對尺寸爲0和1的矩陣執行矩陣運算可能導致各種問題。

有你的代碼一齣戲:

>>> corpus = [dictionary.doc2bow(text) for text in texts] 
>>> corpus 
[[], [(0, 2)]] 
>>> tfidf = models.TfidfModel(corpus) 
2013-07-21 09:23:31,415 : INFO : collecting document frequencies 
2013-07-21 09:23:31,415 : INFO : PROGRESS: processing document #0 
2013-07-21 09:23:31,415 : INFO : calculating IDF weights for 2 documents and 1 features (1 matrix non-zeros) 
>>> corpus = [[(1,)], [(0,2)]] 
>>> tfidf = models.TfidfModel(corpus) 
2013-07-21 09:24:16,452 : INFO : collecting document frequencies 
2013-07-21 09:24:16,452 : INFO : PROGRESS: processing document #0 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 96, in __init__ 
    self.initialize(corpus) 
    File "/usr/local/lib/python2.7/dist-packages/gensim/models/tfidfmodel.py", line 119, in initialize 
    for termid, _ in bow: 
ValueError: need more than 1 value to unpack 
>>> corpus = [[(1,3)], [(0,2)]] 
>>> tfidf = models.TfidfModel(corpus) 
2013-07-21 09:24:26,892 : INFO : collecting document frequencies 
2013-07-21 09:24:26,892 : INFO : PROGRESS: processing document #0 
2013-07-21 09:24:26,892 : INFO : calculating IDF weights for 2 documents and 2 features (2 matrix non-zeros) 
>>> 

正如我上面說,你需要確保在調用它models.TfidfModel(corpus)之前corpus有任何空列表。

+0

你能解釋一下嗎? – qmaruf

+0

查看更新的答案。 –

+0

空的列表(=空文檔)非常好。第二個例子失敗,因爲'(1,)'1-tuple不是一個有效的稀疏條目。必須始終是'(token_id,token_weight)'2元組。 – Radim

0

這不是一個錯誤,它是一個警告。你可以忽略它。

您的查詢文檔doc在第二種情況下爲空,這會導致警告。無論如何,你仍然可以得到正確的答案(=空vec_lsi)。