2009-05-30 18 views
4

我有兩個PHP函數來計算兩個文本之間的關係。他們都使用文字模型包,但check2()更快。無論如何,這兩個函數都有相同的結果。爲什麼? check1()使用一個包含所有單詞的大型字典數組 - 如單詞模型包中所述。 check2()不使用一個大數組,而是隻包含一個文本單詞的數組。所以check2()不應該工作,但它沒有。爲什麼這兩個函數都有相同的結果?一袋文字模型:2個PHP函數,結果相同:爲什麼?

function check1($terms_in_article1, $terms_in_article2) { 
    global $zeit_check1; 
    $zeit_s = microtime(TRUE); 
    $length1 = count($terms_in_article1); // number of words 
    $length2 = count($terms_in_article2); // number of words 
    $all_terms = array_merge($terms_in_article1, $terms_in_article2); 
    $all_terms = array_unique($all_terms); 
    foreach ($all_terms as $all_termsa) { 
     $term_vector1[$all_termsa] = 0; 
     $term_vector2[$all_termsa] = 0; 
    } 
    foreach ($terms_in_article1 as $terms_in_article1a) { 
     $term_vector1[$terms_in_article1a]++; 
    } 
    foreach ($terms_in_article2 as $terms_in_article2a) { 
     $term_vector2[$terms_in_article2a]++; 
    } 
    $score = 0; 
    foreach ($all_terms as $all_termsa) { 
     $score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa]; 
    } 
    $score = $score/($length1*$length2); 
    $score *= 500; // for better readability 
    $zeit_e = microtime(TRUE); 
    $zeit_check1 += ($zeit_e-$zeit_s); 
    return $score; 
} 
function check2($terms_in_article1, $terms_in_article2) { 
    global $zeit_check2; 
    $zeit_s = microtime(TRUE); 
    $length1 = count($terms_in_article1); // number of words 
    $length2 = count($terms_in_article2); // number of words 
    $score_table = array(); 
    foreach($terms_in_article1 as $term){ 
     if(!isset($score_table[$term])) $score_table[$term] = 0; 
     $score_table[$term] += 1; 
    } 
    $score_table2 = array(); 
    foreach($terms_in_article2 as $term){ 
     if(isset($score_table[$term])){ 
      if(!isset($score_table2[$term])) $score_table2[$term] = 0; 
      $score_table2[$term] += 1; 
     } 
    } 
    $score = 0; 
    foreach($score_table2 as $key => $entry){ 
     $score += $score_table[$key] * $entry; 
    } 
    $score = $score/($length1*$length2); 
    $score *= 500; 
    $zeit_e = microtime(TRUE); 
    $zeit_check2 += ($zeit_e-$zeit_s); 
    return $score; 
} 

我希望你能幫助我。提前致謝!

+0

我很高興你找到能夠解釋它的人:) werner也不錯!保重! – 0scar 2009-06-01 14:08:29

回答

3

這兩個函數都實現了幾乎相同的算法,但第一個函數以直接的方式執行,第二個函數更聰明一點,並跳過一部分不必要的工作。

CHECK1是這樣的:

// loop length(words1) times 
for each word in words1: 
    freq1[word]++ 

// loop length(words2) times 
for each word in words2: 
    freq2[word]++ 

// loop length(union(words1, words2)) times 
for each word in union(words1, words2): 
    score += freq1[word] * freq2[word] 

但要記住:當你的東西乘以零,你會得到零。

這意味着,計算不在兩組中的單詞的頻率是浪費時間 - 我們將頻率乘以零,這將不增加分數。

CHECK2考慮到了這:

// loop length(words1) times 
for each word in words1: 
    freq1[word]++ 

// loop length(words2) times 
for each word in words2: 
    if freq1[word] > 0: 
     freq2[word]++ 

// loop length(intersection(words1, words2)) times 
for each word in freq2: 
    score += freq1[word] * freq2[word] 
6

因爲你似乎關心性能,這裏的算法在你的CHECK2功能的優化版,使用了一些更多的內置功能,以提高速度。

function check ($terms1, $terms2) 
{ 
    $counts1 = array_count_values($terms1); 
    $totalScore = 0; 
    foreach ($terms2 as $term) { 
     if (isset($counts1[$term])) $totalScore += $counts1[$term]; 
    } 
    return $totalScore * 500/(count($terms1) * count($terms2)); 
} 
+0

非常感謝Werner。你是對的,你的版本比我的兩個版本中最快的版本更快。不幸的是,我詢問了其他功能爲什麼也起作用的原因。所以我必須選擇Rene Saarsoo的答案,因爲他完美地回答了我的問題。但你也幫助我很多。謝謝! :) – caw 2009-05-31 12:21:32

相關問題