我使用這個gensim教程來查找文本之間的相似性。以下是代碼python gensim:indices數組有非整數dtype(float64)
from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
'''
documents = ["Human machine interface for lab abc computer applications",
"bags loose tea water second ingredient tastes water",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
"red cow butter oil"]
'''
documents = ["Human machine interface for lab abc computer applications",
"bags loose tea water second ingredient tastes water"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
for text in texts]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
#print corpus
tfidf = models.TfidfModel(corpus)
#print tfidf
corpus_tfidf = tfidf[corpus]
#print corpus_tfidf
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lsi.print_topics(1)
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
lda.print_topics(1)
corpora.MmCorpus.serialize('dict.mm', corpus)
corpus = corpora.MmCorpus('dict.mm')
#print corpus
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)
doc = "human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
#print vec_lsi
index = similarities.MatrixSimilarity(lsi[corpus])
index.save('dict.index')
index = similarities.MatrixSimilarity.load('dict.index')
sims = index[vec_lsi]
#print list(enumerate(sims))
sims = sorted(enumerate(sims),key=lambda item: -item[1])
for sim in sims:
print documents[sim[0]], " ==> ", sim[1]
這裏有兩個文件。一個有10個文本,另一個有2個。一個被註釋掉。如果我使用第一個文檔列表,一切都很好,並生成有意義的輸出。如果我使用第二個文檔列表(有兩個文本)發生錯誤。這是它
/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:122: UserWarning: indices array has non-integer dtype (float64)
% self.indices.dtype.name)
這個錯誤背後的原因是什麼,我該如何解決它? 我正在使用一臺64位機器。
你能解釋一下嗎? – qmaruf
查看更新的答案。 –
空的列表(=空文檔)非常好。第二個例子失敗,因爲'(1,)'1-tuple不是一個有效的稀疏條目。必須始終是'(token_id,token_weight)'2元組。 – Radim