在中文文本語料庫中搜索僅包含某些字符的句子

目標：搜索數以萬計的中文句子的數組以查找專門包含來自「已知字符」數組的字符的句子。在中文文本語料庫中搜索僅包含某些字符的句子

例如：比方說，我的文集由下面的句子：1）我去中國2）妳愛他3）你在哪裏我只是「知道」或想要的僅僅包含這些字符的句子。？：1）我2）中3）國4）你5）在6）去7）愛8）哪9）裏。第一句將作爲結果返回，因爲它的所有三個字符都在我的第二個數組中。第二句話將被拒絕，因爲我沒有要求妳或他。第三句話將作爲結果返回。忽略標點符號（以及任何字母數字字符）。

我有一個工作腳本來做到這一點（下面）。我想知道這是否是一種有效的方式。如果你有興趣，請看看並提出修改建議，寫你自己的，或給一些建議。我從this script收集了一些並檢出了一些計算器問題，但他們沒有解決這種情況。

<?php 
$known_characters = parse_file("FILENAME") // retrieves target characters 
$sentences = parse_csv("FILENAME"); // retrieves the text corpus 

$number_wanted = 30; // number of sentences to attempt to retrieve 

$found = array(); // stores results 
$number_found = 0; // number of results 
$character_known = false; // assume character is not known 
$sentence_known = true; // assume sentence matches target characters 

foreach ($sentences as $s) { 

    // retrieves an array of the sentence 
    $sentence_characters = mb_str_split($s->ttext); 

    foreach ($sentence_characters as $sc) { 
     // check to see if the character is alpha-numeric or punctuation 
     // if so, then ignore. 
     $pattern = '/[a-zA-Z0-9\s\x{3000}-\x{303F}\x{FF00}-\x{FF5A}]/u'; 
     if (!preg_match($pattern, $sc)) { 
      foreach ($known_characters as $kc) {; 
       if ($sc==$kc) { 
        // if character is known, move to next character 
        $character_known = true; 
        break; 
       } 
      } 
     } else { 
      // character is known if it is alpha-numeric or punctuation 
      $character_known = true; 
     } 
     if (!$character_known) { 
      // if character is unknown, move to next sentence 
      $sentence_known = false; 
      break; 
     } 
     $character_known = false; // reset for next iteration 
    } 
    if ($sentence_known) { 
     // if sentence is known, add it to results array 
     $found[] = $s->ttext; 
     $number_found = $number_found+1; 
    } 
    if ($number_found==$number_wanted) 
     break; // if required number of results are found, break 

    $sentence_known = true; // reset for next iteration 
} 
?>

來源

2012-04-20 tsroten

在我看來，這應該這樣做：

$pattern = '/[^a-zA-Z0-9\s\x{3000}-\x{303F}\x{FF00}-\x{FF5A}我中國你在去愛哪裏]/u'; 
if (preg_match($pattern, $sentence) { 
    // the sentence contains characters besides a-zA-Z0-9, punctuation 
    // and the selected characters 
} else { 
    // the sentence contains only the allowed characters 
}

請務必保存您的源代碼文件中的UTF-8。

來源

2012-04-20 12:58:46 deceze

不錯，我很欣賞簡單。是否有一個正則表達式變得太長了？例如，如果我在搜索只包含2000個不同字符的字符的句子，是不是會推動它呢？ – tsroten 2012-04-20 14:36:30

從技術上講，應該可以正常工作，可能比反覆循環2000個字符更好。你可能不想爲此存儲文字正則表達式，但可以動態構建它。 – deceze 2012-04-20 14:44:48

太棒了，謝謝你的回答，它效果很好。我對於正則表達式很陌生，所以我對它的能力一無所知。 – tsroten 2012-04-20 15:26:15

在中文文本語料庫中搜索僅包含某些字符的句子

回答

相關問題