在PHP中將字符串拆分爲一個Unicode字符數組的最佳方式是什麼？

在PHP < 6中，將字符串拆分爲Unicode字符數組的最佳方式是什麼？如果輸入不一定是UTF-8？在PHP中將字符串拆分爲一個Unicode字符數組的最佳方式是什麼？

我想知道輸入字符串中的Unicode字符集是否是另一組Unicode字符的子集。

爲什麼不直接爲mb_函數族運行，因爲第一對答案沒有？

2009-09-08 joeforker

您是否意識到比較Unicode字符並不重要，具體取決於您想要的比較類型？例如，你可以寫成U + 00DC或U + 0075 U + 0308。 – derobert

是的，我確實意識到這一點。如果它成爲一個問題，那麼我需要將輸入轉換爲分割前的Unicode常規形式之一。 – joeforker

你可以使用 'U' 修飾符PCRE正則表達式;看到Pattern Modifiers（引用）：

U（PCRE8）

這個修飾符打開PCRE附加功能，是不相容用Perl。模式字符串被視爲UTF-8。這個修飾符可以從Unix上的PHP 4.1.0 或更高版本和win32上的PHP 4.2.3 獲得。自PHP 4.3.5以來，檢查了模式的UTF-8有效性。

例如，考慮下面的代碼：

header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder 
$str = "abc 文字化け, efg"; 

$results = array(); 
preg_match_all('/./', $str, $results); 
var_dump($results[0]);

你會得到一個不可用的結果：

array 
    0 => string 'a' (length=1) 
    1 => string 'b' (length=1) 
    2 => string 'c' (length=1) 
    3 => string ' ' (length=1) 
    4 => string '�' (length=1) 
    5 => string '�' (length=1) 
    6 => string '�' (length=1) 
    7 => string '�' (length=1) 
    8 => string '�' (length=1) 
    9 => string '�' (length=1) 
    10 => string '�' (length=1) 
    11 => string '�' (length=1) 
    12 => string '�' (length=1) 
    13 => string '�' (length=1) 
    14 => string '�' (length=1) 
    15 => string '�' (length=1) 
    16 => string ',' (length=1) 
    17 => string ' ' (length=1) 
    18 => string 'e' (length=1) 
    19 => string 'f' (length=1) 
    20 => string 'g' (length=1)

但是，與此代碼：

header('Content-type: text/html; charset=UTF-8'); // So the browser doesn't make our lives harder 
$str = "abc 文字化け, efg"; 

$results = array(); 
preg_match_all('/./u', $str, $results); 
var_dump($results[0]);

（注意正則表達式末尾的'u'）

你得到你想要的東西：

array 
    0 => string 'a' (length=1) 
    1 => string 'b' (length=1) 
    2 => string 'c' (length=1) 
    3 => string ' ' (length=1) 
    4 => string '文' (length=3) 
    5 => string '字' (length=3) 
    6 => string '化' (length=3) 
    7 => string 'け' (length=3) 
    8 => string ',' (length=1) 
    9 => string ' ' (length=1) 
    10 => string 'e' (length=1) 
    11 => string 'f' (length=1) 
    12 => string 'g' (length=1)

希望這有助於:-)

來源

2009-09-08 21:39:21

+1很好的詳細例子！ :) –

@Shadi Almosri：謝謝:-) –

試試這個：

preg_match_all('/./u', $text, $array);

來源

2009-09-08 21:35:05 JasonWoof

+1這很聰明！ – Gumbo

如果由於某種原因，正則表達式的方法是不夠的，你。我曾經寫過Zend_Locale_UTF8，但是如果你決定自己做，可能會幫助你。

特別看看Zend_Locale_UTF8_PHP5_String這個類，它讀取Unicode字符串，並與它們一起工作，將它們分成單個字符（顯然可能由多個字節組成）。

編輯：我只是relaized是ZF的SVN瀏覽器關閉，因此我複製了便利的重要方法：

/** 
* Returns the UTF-8 code sequence as an array for any given $string. 
* 
* @access protected 
* @param string|integer $string 
* @return array 
*/ 
protected function _decode($string) { 

    $string  = (string) $string; 
    $length  = strlen($string); 
    $sequence = array(); 

    for ($i=0; $i<$length;) { 
     $bytes  = $this->_characterBytes($string, $i); 
     $ord  = $this->_ord($string, $bytes, $i); 

     if ($ord !== false) 
      $sequence[] = $ord; 

     if ($bytes === false) 
      $i++; 
     else 
      $i += $bytes; 
    } 

    return $sequence; 

} 

/** 
* Returns the UTF-8 code of a character. 
* 
* @see http://en.wikipedia.org/wiki/UTF-8#Description 
* @access protected 
* @param string $string 
* @param integer $bytes 
* @param integer $position 
* @return integer 
*/ 
protected function _ord(&$string, $bytes = null, $pos=0) 
{ 
    if (is_null($bytes)) 
     $bytes = $this->_characterBytes($string); 

    if (strlen($string) >= $bytes) { 

     switch ($bytes) { 
      case 1: 
       return ord($string[$pos]); 
       break; 

      case 2: 
       return ((ord($string[$pos]) & 0x1f) << 6) + 
         ((ord($string[$pos+1]) & 0x3f)); 
       break; 

      case 3: 
       return ((ord($string[$pos]) & 0xf) << 12) + 
         ((ord($string[$pos+1]) & 0x3f) << 6) + 
         ((ord($string[$pos+2]) & 0x3f)); 
       break; 

      case 4: 
       return ((ord($string[$pos]) & 0x7) << 18) + 
         ((ord($string[$pos+1]) & 0x3f) << 12) + 
         ((ord($string[$pos+1]) & 0x3f) << 6) + 
         ((ord($string[$pos+2]) & 0x3f)); 
       break; 

      case 0: 
      default: 
       return false; 
     } 
    } 

    return false; 
} 
/** 
* Returns the number of bytes of the $position-th character. 
* 
* @see http://en.wikipedia.org/wiki/UTF-8#Description 
* @access protected 
* @param string $string 
* @param integer $position 
*/ 
protected function _characterBytes(&$string, $position = 0) { 
    $char  = $string[$position]; 
    $charVal = ord($char); 

    if (($charVal & 0x80) === 0) 
     return 1; 

    elseif (($charVal & 0xe0) === 0xc0) 
     return 2; 

    elseif (($charVal & 0xf0) === 0xe0) 
     return 3; 

    elseif (($charVal & 0xf8) === 0xf0) 
     return 4; 
    /* 
    elseif (($charVal & 0xfe) === 0xf8) 
     return 5; 
    */ 

    return false; 
}

來源

2009-09-08 21:52:45

我能夠編寫出使用mb_*一個解決方案，包括了一趟UTF -16和背部的可能愚蠢試圖加快字符串索引：

$japanese2 = mb_convert_encoding($japanese, "UTF-16", "UTF-8"); 
$length = mb_strlen($japanese2, "UTF-16"); 
for($i=0; $i<$length; $i++) { 
    $char = mb_substr($japanese2, $i, 1, "UTF-16"); 
    $utf8 = mb_convert_encoding($char, "UTF-8", "UTF-16"); 
    print $utf8 . "\n"; 
}

我有更好的運氣避免mb_internal_encoding，只是指定了在EAC一切呼叫。我相信我會結束使用preg解決方案。

來源

2009-09-09 01:23:09 joeforker

稍微簡單比preg_match_all：

preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY)

這給你回一個字符的一維數組。不需要匹配對象。

來源

2015-05-26 20:33:27 mpen

這個答案是最有意義的，也就是說，邏輯上的目標是分裂，我們不關心匹配每一個字符（即使可以在背景）。我正要用你的解決方案來回答這個問題，但有一點區別：極限（第三參數）可能具有NULL而不是-1，因爲«-1，0或NULL意味着「沒有限制「，並且，就像整個PHP的標準一樣，您可以使用'NULL' [跳轉到flags參數]（http://php.net/manual/en/function.preg-split.php）»。 – Armfoot

在PHP中將字符串拆分爲一個Unicode字符數組的最佳方式是什麼？

回答

相關問題