2011-02-23 73 views
7

PHP中的函數levenshtein對最大長度爲255的字符串起作用。什麼是計算PHP中句子相似度得分的好替代方法。PHP中的字符串相似度:用於長字符串的levenshtein like函數

基本上我有一個句子的數據庫,我想找到大概的重複。 similar_text函數沒有給我預期的結果。什麼是對我來說,檢測像下面相似的句子最簡單的方法:

$ss="Jack is a very nice boy, isn't he?"; 
$pp="jack is a very nice boy is he"; 

$ss=strtolower($ss); // convert to lower case as we dont care about case 
$pp=strtolower($pp); 

$score=similar_text($ss, $pp); 
echo "$score %\n"; // Outputs just 29 % 

$score=levenshtein ($ss, $pp); 
echo "$score\n"; // Outputs '5', which indicates they are very similar. But, it does not work for more than 255 chars :(
+0

注意:我不關心語義的含義。 – anon 2011-02-23 15:37:27

回答

9

levenshtein算法的O(n*m),其中nm是兩個輸入字符串的長度時的複雜性。這非常昂貴,計算長字符串的距離需要很長時間。

對於整個句子,你可能想使用一個diff算法代替,例如參見:Highlight the difference between two strings in PHP

話雖如此,PHP還提供了similar_text功能,有一個更糟糕的複雜性(O(max(n,m)**3)),但似乎工作在更長的琴絃上。

+0

謝謝。你能否告訴我,我是否錯誤地解釋了similar_text的結果?我希望能夠像在示例中那樣檢測類似的句子。 – anon 2011-02-23 15:36:54

+0

'similar_text'返回匹配字符的數量,而不是百分比。有第三個可選參數可用於查找百分比。 – 2011-02-23 15:51:03

+0

噢好吧..我想第三個參數就是如果你想通過引用傳遞結果變量:)。 – anon 2011-02-24 05:17:42

2

您可以嘗試使用similar_text
它可以變得非常慢,20,000個字符(3-5秒),但你的例子你只提到句子,這將適用於這種用法。

有一點需要注意的是,比較不同尺寸的字符串時,您將無法獲得100%的效果。例如,如果您將「他」與「頭」進行比較,則只會得到50%的匹配。

3

我發現Smith Waterman Gotoh是比較句子的最佳算法。更多信息in this answer。這裏是PHP代碼示例:

class SmithWatermanGotoh 
{ 
    private $gapValue; 
    private $substitution; 

    /** 
    * Constructs a new Smith Waterman metric. 
    * 
    * @param gapValue 
    *   a non-positive gap penalty 
    * @param substitution 
    *   a substitution function 
    */ 
    public function __construct($gapValue=-0.5, 
       $substitution=null) 
    { 
     if($gapValue > 0.0) throw new Exception("gapValue must be <= 0"); 
     //if(empty($substitution)) throw new Exception("substitution is required"); 
     if (empty($substitution)) $this->substitution = new SmithWatermanMatchMismatch(1.0, -2.0); 
     else $this->substitution = $substitution; 
     $this->gapValue = $gapValue; 
    } 

    public function compare($a, $b) 
    { 
     if (empty($a) && empty($b)) { 
      return 1.0; 
     } 

     if (empty($a) || empty($b)) { 
      return 0.0; 
     } 

     $maxDistance = min(mb_strlen($a), mb_strlen($b)) 
       * max($this->substitution->max(), $this->gapValue); 
     return $this->smithWatermanGotoh($a, $b)/$maxDistance; 
    } 

    private function smithWatermanGotoh($s, $t) 
    { 
     $v0 = []; 
     $v1 = []; 
     $t_len = mb_strlen($t); 
     $max = $v0[0] = max(0, $this->gapValue, $this->substitution->compare($s, 0, $t, 0)); 

     for ($j = 1; $j < $t_len; $j++) { 
      $v0[$j] = max(0, $v0[$j - 1] + $this->gapValue, 
        $this->substitution->compare($s, 0, $t, $j)); 

      $max = max($max, $v0[$j]); 
     } 

     // Find max 
     for ($i = 1; $i < mb_strlen($s); $i++) { 
      $v1[0] = max(0, $v0[0] + $this->gapValue, $this->substitution->compare($s, $i, $t, 0)); 

      $max = max($max, $v1[0]); 

      for ($j = 1; $j < $t_len; $j++) { 
       $v1[$j] = max(0, $v0[$j] + $this->gapValue, $v1[$j - 1] + $this->gapValue, 
         $v0[$j - 1] + $this->substitution->compare($s, $i, $t, $j)); 

       $max = max($max, $v1[$j]); 
      } 

      for ($j = 0; $j < $t_len; $j++) { 
       $v0[$j] = $v1[$j]; 
      } 
     } 

     return $max; 
    } 
} 

class SmithWatermanMatchMismatch 
{ 
    private $matchValue; 
    private $mismatchValue; 

    /** 
    * Constructs a new match-mismatch substitution function. When two 
    * characters are equal a score of <code>matchValue</code> is assigned. In 
    * case of a mismatch a score of <code>mismatchValue</code>. The 
    * <code>matchValue</code> must be strictly greater then 
    * <code>mismatchValue</code> 
    * 
    * @param matchValue 
    *   value when characters are equal 
    * @param mismatchValue 
    *   value when characters are not equal 
    */ 
    public function __construct($matchValue, $mismatchValue) { 
     if($matchValue <= $mismatchValue) throw new Exception("matchValue must be > matchValue"); 

     $this->matchValue = $matchValue; 
     $this->mismatchValue = $mismatchValue; 
    } 

    public function compare($a, $aIndex, $b, $bIndex) { 
     return ($a[$aIndex] === $b[$bIndex] ? $this->matchValue 
       : $this->mismatchValue); 
    } 

    public function max() { 
     return $this->matchValue; 
    } 

    public function min() { 
     return $this->mismatchValue; 
    } 
} 

$str1 = "Jack is a very nice boy, isn't he?"; 
$str2 = "jack is a very nice boy is he"; 
$o = new SmithWatermanGotoh(); 
echo $o->compare($str1, $str2);