Python tf-idf：快速更新tf-idf矩陣的方法

我有一個包含數千行文本的數據集，我的目標是計算tfidf得分，然後計算文檔之間的餘弦相似度，這就是我使用gensim在Python跟着教程：Python tf-idf：快速更新tf-idf矩陣的方法

dictionary = corpora.Dictionary(dat) 
corpus = [dictionary.doc2bow(text) for text in dat] 

tfidf = models.TfidfModel(corpus) 
corpus_tfidf = tfidf[corpus] 
index = similarities.MatrixSimilarity(corpus_tfidf)

比方說，我們有TFIDF矩陣和相似建，當我們有一個新的文檔進來，我想查詢在我們現有的數據集最類似的文件。

問題：有什麼方法可以更新tf-idf矩陣，以便我不必將新文本文件追加到原始數據集並重新計算整個事物？

來源

2017-02-13 snowneji

我會發布我的解決方案，因爲沒有其他答案。比方說，我們在以下情形：

import gensim 
from gensim import models 
from gensim import corpora 
from gensim import similarities 
from nltk.tokenize import word_tokenize 
import pandas as pd 

# routines: 
text = "I work on natural language processing and I want to figure out how does gensim work" 
text2 = "I love computer science and I code in Python" 
dat = pd.Series([text,text2]) 
dat = dat.apply(lambda x: str(x).lower()) 
dat = dat.apply(lambda x: word_tokenize(x)) 


dictionary = corpora.Dictionary(dat) 
corpus = [dictionary.doc2bow(doc) for doc in dat] 
tfidf = models.TfidfModel(corpus) 
corpus_tfidf = tfidf[corpus] 


#Query: 
query_text = "I love icecream and gensim" 
query_text = query_text.lower() 
query_text = word_tokenize(query_text) 
vec_bow = dictionary.doc2bow(query_text) 
vec_tfidf = tfidf[vec_bow]

如果我們看一下：

print(vec_bow) 
[(0, 1), (7, 1), (12, 1), (15, 1)]

和：

print(tfidf[vec_bow]) 
[(12, 0.7071067811865475), (15, 0.7071067811865475)]

FYI ID和DOC：

print(dictionary.items()) 

[(0, u'and'), 
(1, u'on'), 
(8, u'processing'), 
(3, u'natural'), 
(4, u'figure'), 
(5, u'language'), 
(9, u'how'), 
(7, u'i'), 
(14, u'code'), 
(19, u'in'), 
(2, u'work'), 
(16, u'python'), 
(6, u'to'), 
(10, u'does'), 
(11, u'want'), 
(17, u'science'), 
(15, u'love'), 
(18, u'computer'), 
(12, u'gensim'), 
(13, u'out')]

外貌就像查詢只是拿起現有的條款和使用預先計算的權重爲您提供tfidf分數。所以我的解決方法是每週或每天重建模型，因爲這樣做很快。

來源

2017-05-09 21:40:30 snowneji

這實際上工作嗎？我原以爲由於tfidf的本質，從根本上說，你不能逐步更新模型（更新tfidf矩陣），因爲每次有新文檔進入時，都必須更新包含在文檔中的所有相關單詞特徵的IDF值整個語料庫的新文檔。另外，當一個文檔帶有一個新詞時會發生什麼 - 不會有一個特徵長度不匹配？請讓我知道，因爲我也在積極研究這個問題 – killerT2333

它的工作，但我相信什麼是隻使用您現有的模型查詢您的新文檔。我將編輯我的答案以顯示作品。 – snowneji

哇！這真的很酷 - 非常感謝分享這個。因此，如果我理解正確，當新的查詢文檔進入時，gensim會根據預先計算的tfidf矩陣計算tfidf分數_and_新的查詢文檔？或者它只是從預先計算的tfidf矩陣中計算出來的呢？定期更新模型更有意義，如果不斷有新的查詢進來，尤其是如果更新模型昂貴 – killerT2333

Python tf-idf：快速更新tf-idf矩陣的方法

回答

相關問題