檢測字符串中的非英文字符

爲了嘗試打擊一些垃圾郵件，我正在尋找一種方法來查明一個字符串是否包含任何漢字/西里爾字符。檢測字符串中的非英文字符

我已經檢查過UTF-8中的字符範圍http://en.wikipedia.org/wiki/UTF-8，但我無法弄清楚如何在PHP中使用這些字符。

我真正想要做的是計算西里爾文範圍或中文範圍內的字符數。這可以用一些正則表達式來完成嗎？

來源

2012-10-29 Steven De Groote

看看這個：http://www.regular-expressions.info/unicode.html如果你提供了一些示例輸入，我可以測試一些東西，並可能提供一個答案。 –

您可以檢查每個字符的字節值以包含在特定的Unicode範圍內。這裏是Unicode範圍的列表：http://jrgraphix.net/research/unicode_blocks.php

來源

2012-10-29 11:14:30 muehlbau

非常感謝！我也發現這爲一些範圍創建正則表達式匹配器：http://kourge.net/projects/regexp-unicode-block –

很酷，感謝您提供鏈接。這可能非常有用;） – muehlbau

可以輕鬆地檢查一個字符串是否是純粹的UTF-8通過使用這樣的：

mb_check_encoding($inputString, "UTF-8");

只要留意，它似乎已經從5.2.0蟲子5.2.6

你可能會在文檔頁面上找到你想要的東西，太mb_check_encoding，特別是在註釋中。修改javalc6在Gmail的點com的答案你的情況：

function check_utf8($str) { 
    $count = 0; // Amount of characters that are not UTF-8 
    $len = strlen($str); 
    for($i = 0; $i < $len; $i++){ 
     $c = ord($str[$i]); 
     if ($c > 128) { 
      $bytes = 0; 
      if ($c > 247) { 
       ++$count; 
       continue; 
      } else if ($c > 239) 
       $bytes = 4; 
      else if ($c > 223) 
       $bytes = 3; 
      else if ($c > 191) 
       $bytes = 2; 
      else { 
       ++$count; 
       continue; 
      } 
      if (($i + $bytes) > $len) { 
       ++$count; 
       continue; 
      } 
      while ($bytes > 1) { 
       $i++; 
       $b = ord($str[$i]); 
       if ($b < 128 || $b > 191) 
        ++$count; 
       $bytes--; 
      } 
     } 
    } 
    return count; 
}

雖然我老實說沒有檢查它。

來源

2012-10-29 11:27:20 Warpten

在PHP中，preg_match_all返回完整模式匹配的數量。

嘗試

$n = preg_match_all('/\p{Cyrillic}/u', $text);

或

$n = preg_match_all('/[\p{InCyrillic}\p{InCyrillic_Supplementary}]/u', $text);

對於正則表達式使用Unicode有關更多信息請閱讀this article。

來源

2012-10-29 12:20:29

檢測字符串中的非英文字符

回答

相關問題