2012-04-29 88 views
1

我有這個簡單的代碼來計算字符串中的標點符號。即「有2個逗號,3個分號......」等。但是當它看到一個em-dash( - )時它不起作用。請注意,它不是連字符( - ),我不關心這些。具有標點符號行爲的奇怪關聯數組

是否有什麼特別的em-dash讓它在PHP字符串和/或數組鍵中變得怪異?也許是一個奇怪的unicode問題?

$punc_counts = array(
    "," => 0, 
    ";" => 0, 
    "—" => 0, //exists, really! 
    "'" => 0, 
    "\"" => 0, 
    "(" => 0, 
    ")" => 0, 
); 

// $str is a long string of text 

//remove all non-punctuation chars from $str (works correctly, keeping em-dashes) 
$puncs = ""; 
foreach($punc_counts as $key => $value) 
    $puncs .= $key; 
$str = preg_replace("/[^{$puncs}]/", "", $str); 

//$str now equals something like: 
//$str == ",;'—\"—()();;,"; 

foreach(str_split($str) as $char) 
{  
    //if it's a puncutation char we care about, count it 
    if(isset($punc_counts[$char])) 
     $punc_counts[$char]++; 
    else 
     print($char); 
} 

print("<br/>"); 
print_r($punc_counts); 
print("<br/>"); 

上面打印的代碼:

—— 
Array ([,] => 2 [;] => 3 [—] => 0 ['] => 1 ["] => 1 [(] => 2 [)] => 2) 

回答

2

它可能不是多字節兼容。還有就是PHP文檔頁面上的useful commentstr_split是提出以下建議:

function str_split_unicode($str, $l = 0) { 
    if ($l > 0) { 
     $ret = array(); 
     $len = mb_strlen($str, "UTF-8"); 
     for ($i = 0; $i < $len; $i += $l) { 
      $ret[] = mb_substr($str, $i, $l, "UTF-8"); 
     } 
     return $ret; 
    } 
    return preg_split("//u", $str, -1, PREG_SPLIT_NO_EMPTY); 
}