2
我有一組文件和一個查詢文檔。我的目的是通過與每個文檔的查詢文檔進行比較來返回最相似的文檔。要首先使用餘弦相似性,我必須將文檔字符串映射到vectors.Also我已經創建了一個tf-idf函數計算每個文檔。通過在python中使用餘弦相似度返回最相似的文檔與查詢文檔相比較
爲了得到字符串的索引我有一個這樣的函數;
def getvectorKeywordIndex(self, documentList):
""" create the keyword associated to the position of the elements within the document vectors """
#Mapped documents into a single word string
vocabularyString = " ".join(documentList)
vocabularylist= vocabularyString.split(' ')
vocabularylist= list(set(vocabularylist))
print 'vocabularylist',vocabularylist
vectorIndex={}
offset=0
#Associate a position with the keywords which maps to the dimension on the vector used to represent this word
for word in vocabularylist:
vectorIndex[word]=offset
offset+=1
print vectorIndex
return vectorIndex,vocabularylist #(keyword:position),vocabularylist
和餘弦相似性我的功能是,
def cosine_distance(self,index, queryDoc):
vector1= self.makeVector(index)
vector2= self.makeVector(queryDoc)
return numpy.dot(vector1, vector2)/(math.sqrt(numpy.dot(vector1, vector1)) * math.sqrt(numpy.dot(vector2, vector2)))
TF-IDF is;
def tfidf(self, term, key):
return (self.tf(term,key) * self.idf(term))
我的問題是,如何通過使用索引和詞彙列表以及該函數內的tf-idf來創建makevector。 歡迎任何答案。