PHP：2個字符串 - 哪一個是UTF-8，哪一個不是？

我有一個數據庫有很多字符串。其中一些是正確的UTF-8編碼，其中一些不是。因此，我建立了一個腳本，從db中選擇100個字符串。下面的函數確定一個字符串是否包含UTF-8或沒有（不管它是正確的）：PHP：2個字符串 - 哪一個是UTF-8，哪一個不是？

function detectUTF8($text) { 
    return preg_match('%(?: 
     [\xC2-\xDF][\x80-\xBF]    # non-overlong 2-byte 
     |\xE0[\xA0-\xBF][\x80-\xBF]  # excluding overlongs 
     |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte 
     |\xED[\x80-\x9F][\x80-\xBF]  # excluding surrogates 
     |\xF0[\x90-\xBF][\x80-\xBF]{2}  # planes 1-3 
     |[\xF1-\xF3][\x80-\xBF]{3}   # planes 4-15 
     |\xF4[\x80-\x8F][\x80-\xBF]{2}  # plane 16 
     )+%xs', 
    $text); 
}

The output of of script is these strings containing UTF-8 and - after a line break - the utf8_decode() string. Since some strings are double encoded, I decode all strings which you can see there.

The result is a list with some entries with 2 strings each: one is correct, the other one is wrong. You can see it here。但是，我如何確定哪一個是正確的？

我希望你能幫助我。提前致謝！

來源

2009-06-12 caw

哇！這是一些看起來很嚴峻的UTF-8支持 – 2009-06-12 20:43:22

你認爲這個好東西不好嗎？你有更好的代碼嗎？我從http://www.unspecifiederror.net/2008/09/11/detecting-utf8-in-php-without-multibyte/獲得了代碼（謝謝你miek）。 – caw 2009-06-12 22:11:24