如何檢查文本是否包含在另一個文件中？

我正在開發一個文件系統，每次創建一個新文件系統時，它都必須檢測並丟棄包含大約500,000條記錄的數據庫中的重複項。現在，我使用搜索引擎來檢索20個最相似的文檔，並將它們與我們嘗試創建的新文檔進行比較。問題是我必須檢查新文檔是否類似（使用similar_text很容易），或者即使它包含在其他文本中，所有這些操作都考慮到文本可能已被用戶部分更改（這裏是問題）。我該怎麼做？如何檢查文本是否包含在另一個文件中？

例如：

<?php 

$new = "the wild lion"; 

$candidates = array(
    'the dangerous lion lives in Africa',//$new is contained into this one, but has changed 'wild' to 'dangerous', it has to be detected as duplicate 
    'rhinoceros are native to Africa and three to southern Asia.' 
); 

foreach ($candidates as $candidate) { 
    if($candidate is similar or $new is contained in it) { 
     //Duplicated!! 
    } 
}

當然，在我的系統文件都超過3個字:)

來源

2012-06-16 Javier Marín

這是我使用時間的解決方案：

function contained($text1, $text2, $factor = 0.9) { 
    //Split into words 
    $pattern= '/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u'; 
    $words1 = preg_split($pattern, mb_strtolower($text1), -1, PREG_SPLIT_NO_EMPTY); 
    $words2 = preg_split($pattern, mb_strtolower($text2), -1, PREG_SPLIT_NO_EMPTY); 

    //Set long and short text 
    if (count($words1) > count($words2)) { 
     $long = $words1; 
     $short = $words2; 
    } else { 
     $long = $words2; 
     $short = $words1; 
    } 

    //Count the number of words of the short text that also are in the long 
    $count = 0; 
    foreach ($short as $word) { 
     if (in_array($word, $long)) { 
      $count++; 
     } 
    } 

    return ($count/count($short)) > $factor; 
}

來源

2012-06-22 19:46:46

一些想法越長，你可能承擔或進一步調查是：

索引文件，然後搜索類似的文件。所以，開源索引/搜索系統，如Solr，Sphinx或Zend Search Lucene可以派上用場。可以使用sim hashing algorithm或shingling。簡而言之，simhash算法將允許您爲類似文檔計算相似的散列值。因此，您可以將這個值存儲在每個文檔中，並檢查各種文檔的相似程度。

，你可能會發現有助於從得到一些想法另外的算法是：

1。 Levenshtein distance

2。 Bayesian filtering - SO Questions re Bayesian filtering。在此列表項指向維基貝葉斯垃圾郵件過濾文章首先鏈接，但該算法可以適應你正在嘗試做的。

來源

2012-06-16 11:09:00 Haroon

我的問題是沒有找到類似的文件（我已經在使用索引來查找它們），這是檢查文本包含到另一個。這些算法只工作一個文本比較到另一個，但沒有找到什麼文字部分是最相似的其他文字。 –

如何檢查文本是否包含在另一個文件中？

回答

相關問題