PHP提取多個字符串中的類似部分

這樣做的目的是企圖提取的扉頁多個OCRings一本書的書名。

這也適用於字符串的僅僅是個開始，不需要串的兩端進行修整，並且可以保持原樣。

例如，我的琴絃可能是：

$title[0]='the history of the internet, expanded and revised'; 
$title[1]='the history of the internet'; 
$title[2]='published by xyz publisher the historv of the internot, expanded and'; 
$title[3]='history of the internet';

所以基本上我想以調整每個字符串，以便它開始於最可能的出發點。考慮到可能存在OCR錯誤（例如「historv」，「internot」），我認爲最好從每個單詞中取出一些字符，這會給每個字符串一個數組（所以這是一個多維數組）與每個詞的長度。這可以用來查找運行匹配並將字符串的開始修剪爲最可能的。

該字符串應該削減到：

$title[0]='the history of the internet, expanded and revised'; 
$title[1]='the history of the internet'; 
$title[2]='the historv of the internot, expanded and'; 
$title[3]='XXX history of the internet';

所以我需要能夠認識到（7 2 3 8）「互聯網歷史」是相匹配的所有字符串運行，並且前面的「the」很可能是正確的，因爲它出現在> 50％的字符串中，因此每個字符串的開始都被修剪爲「the」，並且將相同長度的佔位符添加到缺少「the」的字符串中。

到目前爲止我有：

function CompareSimilarStrings($array) 
    { 
    $n=count($array); 

    // Get length of each word in each string > 
    for($run=0; $run<$n; $run++) 
     { 
     $temp=explode(' ',$array[$run]); 
     foreach($temp as $key => $val) 
     $len[$run][$key]=strlen($val); 
     } 

    for($run=0; $run<$n; $run++) 
     { 

     } 
    }

正如你所看到的，我卡上找到運行的比賽。

任何想法？

來源

2012-02-24 Alasdair

OCR是不是可能錯過了一個簡短的單詞或認爲一個字母是一個符號？這些「跑步比賽」似乎不適用於這種可能性。 – erisco 2012-02-24 05:09:03

它不會錯過任何單詞，它會經常出錯，但這就是爲什麼我想要使用每個單詞中的字母數。有時它會添加或刪除一個字母，但腳本仍然會匹配那些沒問題的字符串。 – Alasdair 2012-02-24 05:12:37

我也想問：爲什麼標題不是「互聯網的歷史，擴展和」？它與50％的樣本很好地匹配，並且一個大的子集與其餘的案例匹配。有沒有保證每個樣品都包含完整的標題？這是我能想到的唯一明確的規則會使這個答案無效。 – erisco 2012-02-24 05:15:21

你應該看看Smith-Waterman algorithm本地線對齊。它是一種動態編程算法，它可以找到類似的字符串的部分，它們具有低的edit distance。

所以，如果你想嘗試一下，這裏是一個php implementation of the algorithm。

來源

2012-02-24 15:07:18 gintas

非常有趣的鏈接，謝謝。 – Benj 2012-11-29 15:03:49

PHP提取多個字符串中的類似部分

回答

相關問題