檢查一個字符串是否被編碼爲UTF-8

function seems_utf8($str) { 
$length = strlen($str); 
for ($i=0; $i < $length; $i++) { 
    $c = ord($str[$i]); 
    if ($c < 0x80) $n = 0; # 0bbbbbbb 
    elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb 
    elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb 
    elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb 
    elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb 
    elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b 
    else return false; # Does not match any model 
    for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ? 
    if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80)) 
    return false; 
    } 
} 
return true; 
}

我從Wordpress得到了這段代碼，我對這個知之甚少，但是我想知道這個函數到底是什麼。檢查一個字符串是否被編碼爲UTF-8

如果有人知道請幫幫我嗎？

我需要清楚的關於上述代碼的想法。如果逐行解釋會更有幫助。

來源

2009-09-24 coderex

我用兩種方法來檢查，如果字符串是UTF-8（視情況而定）：

mb_internal_encoding('UTF-8'); // always needed before mb_ functions, check note below 
if (mb_strlen($string) != strlen($string)) { 
/// not single byte 
}

- 或 -

if (preg_match('!\S!u', $string)) { 
// utf8 
}

對於mb_internal_encoding - 由於一些未知對我來說，PHP中的bug（版本5.3-（未在5.3上測試過））將編碼作爲參數傳遞給mb_函數不起作用，並且需要在使用mb_函數之前設置內部編碼。

來源

2009-09-24 18:59:25 bisko

所以只是做'mb_strlen（$字符串， 'UTF-8'）'噸母雞。 – 2015-07-31 19:49:53

該算法基本上檢查字節序列是否符合您可以在Wikipedia article中看到的模式。

for循環將遍歷$str中的所有字節。 ord獲取當前字節的十進制數。然後對這個數字進行一些屬性的測試。

如果數字小於128（0x80），則它是單字節字符。如果它等於或大於128，則檢查多字節字符的長度。這可以通過多字節字符序列的第一個字符來完成。如果第一個字節以110xxxxx開頭，則它是一個雙字節字符; 1110xxxx，它是一個三字節字符等。

我認爲最隱祕的部分是像($c & 0xE0) == 0xC0這樣的表達式。這是爲了檢查二進制格式的數字是否有一些特定的模式。我會試着解釋一下這個例子是如何工作的。由於我們針對該模式測試的所有數字都等於或大於0x80，因此第一個字節始終爲1，因此該模式至少限制爲1xxxxxxxx。如果我們那麼做逐位，並與11100000（取0xE0）相比，我們得到這個結果如下：

1xxxxxxx 
& 11100000 
= 1xx00000

因此，在5位和6位（從右邊看，指數開始在0）取決於我們目前的電話號碼是什麼爲了有等於11000000，第5位必須爲0和第6位必須爲1：

1xxxxxxx 
& 11100000 
≟ 11000000 
    ↓↓ 
→ 110xxxxx

這意味着我們的許多其他位可以是任意的：110xxxxx。這正是維基百科文章中預測的雙字節字的第一個字節的模式。

最後內部for循環是檢查多字節字符的下列字節的完整性。這些都必須以10xxxxxx開頭。

來源

2009-09-24 19:40:54 Gumbo

如果你對UTF-8有一點了解，這是一個非常簡單的實現。

function seems_utf8($str) { 
# get length, for utf8 this means bytes and not characters 
$length = strlen($str); 

# we need to check each byte in the string 
for ($i=0; $i < $length; $i++) { 

    # get the byte code 0-255 of the i-th byte 
    $c = ord($str[$i]); 

    # utf8 characters can take 1-6 bytes, how much 
    # exactly is decoded in the first character if 
    # it has a character code >= 128 (highest bit set). 
    # For all <= 127 the ASCII is the same as UTF8. 
    # The number of bytes per character is stored in 
    # the highest bits of the first byte of the UTF8 
    # character. The bit pattern that must be matched 
    # for the different length are shown as comment. 
    # 
    # So $n will hold the number of additonal characters 

    if ($c < 0x80) $n = 0; # 0bbbbbbb 
    elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb 
    elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb 
    elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb 
    elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb 
    elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b 
    else return false; # Does not match any model 

    # the code now checks the following additional bytes 
    # First in the if checks that the byte is really inside the 
    # string and running over the string end. 
    # The second just check that the highest two bits of all 
    # additonal bytes are always 1 and 0 (hexadecimal 0x80) 
    # which is a requirement for all additional UTF-8 bytes 

    for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ? 
    if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80)) 
    return false; 
    } 
} 
return true; 
}

順便說一下。在PHP上，我認爲這是一個比C函數慢50-100的因子，所以你不應該在長字符串和生產系統上使用它。

來源

2009-09-24 19:41:17 Lothar

絆倒在這篇文章中，也有類似的問題.. mb_detect_encoding表明UTF-8，但mb_check_encoding返回false ...

解決它，我的解決辦法是：

$cur_encoding = mb_detect_encoding($in_str) ; 
    if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8")) 
    return $in_str; 
    else 
    return utf8_encode($in_str);

得到它從有： http://board.phpbuilder.com/showthread.php?10368156-mb_check_encoding%28-in_str-quot-UTF-8-quot-%29-return-different-results

SRY無法發佈正確的鏈接....

來源

2014-10-23 07:48:16 womd

檢查一個字符串是否被編碼爲UTF-8

回答

相關問題