2011-04-12 208 views
1

如何計算餘弦相似度以使用jdbc完成我的搜索引擎項目。 我有表詞條頻率查詢來存儲來自用戶和表詞條頻率文件的輸入來存儲關於文件的所有信息,我已經完成了計算查詢和文件加權。 計算餘弦相似度後的輸出是與用戶輸入查詢相關的文檔的顯示。 我沒有任何想法,我不知道如何計算它,因爲它涉及數據庫中的表。計算餘弦相似度

+0

我不明白你的問題。你問如何查詢一個表?請對你的問題更具體,可能包括一個或兩個例子。 – jzd 2011-04-12 16:37:50

+0

好吧,用戶必須輸入查詢,用戶得到的輸出是來自我已經存儲在表document.i中的數據。我有表tf_query和tf_doc,tf_query存儲來自用戶的數據和tf_doc存儲關於文檔的數據。我已經完成了計算tf-idf和加權,現在我必須計算餘弦相似度。 – user692495 2011-04-12 16:40:01

+0

例如:用戶輸入你好嗎?並輸出是你怎麼親愛的是在表格文件1和文件2中存儲 – user692495 2011-04-12 16:43:42

回答

1

這是計算2個句子之間的餘弦similaritu的程序,我希望你可以做你需要的變化來獲得你想要的。

import java.util.HashMap; 
import java.util.HashSet; 
import java.util.Map; 
import java.util.Set; 

/** 
* 
* @author Xiao Ma 
* mail : [email protected] 
*`enter code here` 
*/ 
    public class SimilarityUtil { 

public static double consineTextSimilarity(String[] left, String[] right) { 
    Map<String, Integer> leftWordCountMap = new HashMap<String, Integer>(); 
    Map<String, Integer> rightWordCountMap = new HashMap<String, Integer>(); 
    Set<String> uniqueSet = new HashSet<String>(); 
    Integer temp = null; 
    for (String leftWord : left) { 
     temp = leftWordCountMap.get(leftWord); 
     if (temp == null) { 
      leftWordCountMap.put(leftWord, 1); 
      uniqueSet.add(leftWord); 
     } else { 
      leftWordCountMap.put(leftWord, temp + 1); 
     } 
    } 
    for (String rightWord : right) { 
     temp = rightWordCountMap.get(rightWord); 
     if (temp == null) { 
      rightWordCountMap.put(rightWord, 1); 
      uniqueSet.add(rightWord); 
     } else { 
      rightWordCountMap.put(rightWord, temp + 1); 
     } 
    } 
    int[] leftVector = new int[uniqueSet.size()]; 
    int[] rightVector = new int[uniqueSet.size()]; 
    int index = 0; 
    Integer tempCount = 0; 
    for (String uniqueWord : uniqueSet) { 
     tempCount = leftWordCountMap.get(uniqueWord); 
     leftVector[index] = tempCount == null ? 0 : tempCount; 
     tempCount = rightWordCountMap.get(uniqueWord); 
     rightVector[index] = tempCount == null ? 0 : tempCount; 
     index++; 
    } 
    return consineVectorSimilarity(leftVector, rightVector); 
} 

/** 
* The resulting similarity ranges from −1 meaning exactly opposite, to 1 
* meaning exactly the same, with 0 usually indicating independence, and 
* in-between values indicating intermediate similarity or dissimilarity. 
* 
* For text matching, the attribute vectors A and B are usually the term 
* frequency vectors of the documents. The cosine similarity can be seen as 
* a method of normalizing document length during comparison. 
* 
* In the case of information retrieval, the cosine similarity of two 
* documents will range from 0 to 1, since the term frequencies (tf-idf 
* weights) cannot be negative. The angle between two term frequency vectors 
* cannot be greater than 90°. 
* 
* @param leftVector 
* @param rightVector 
* @return 
*/ 
private static double consineVectorSimilarity(int[] leftVector, 
     int[] rightVector) { 
    if (leftVector.length != rightVector.length) 
     return 1; 
    double dotProduct = 0; 
    double leftNorm = 0; 
    double rightNorm = 0; 
    for (int i = 0; i < leftVector.length; i++) { 
     dotProduct += leftVector[i] * rightVector[i]; 
     leftNorm += leftVector[i] * leftVector[i]; 
     rightNorm += rightVector[i] * rightVector[i]; 
    } 

    double result = dotProduct 
      /(Math.sqrt(leftNorm) * Math.sqrt(rightNorm)); 
    return result; 
} 

public static void main(String[] args) { 
    String left[] = { "Julie", "loves", "me", "more", "than", "Linda", 
      "loves", "me" }; 
    String right[] = { "Jane", "likes", "me", "more", "than", "Julie", 
      "loves", "me" }; 
    System.out.println(consineTextSimilarity(left,right)); 
} 
} 
+0

如果你可以修改程序請請幫助我,因爲我也需要它,但計算兩個單詞之間的相似性或更多我試過但它不斷給我0,這一個計算兩句之間的相似性。 – 2014-12-10 21:14:20

+0

非常有幫助。謝謝 – vladz 2015-11-10 21:32:38