2009-08-11 22 views
58

由於我的應用程序中存在編碼問題,我需要用正則引號('和「)替換Microsoft Word版本的單引號和雙引號(「 」 ‘ ’),我不需要它們爲HTML實體,我不能改變我的數據庫架構如何在PHP中替換Microsoft編碼的引號

我有兩個選項:爲使用正則表達式或關聯數組

有沒有更好的方式來做到這一點

回答

76

考慮?你只想要替換一些特定的和明確的字符,我會去str_replace與一個陣列:你顯然不需要重火炮正則表達式會帶給你;-)

如果你遇到一些其他的特殊字符(該死的複製粘貼從單詞...),你可以將它們添加到該數組每當必要時/只要他們被識別出來。


編輯:最好的答案我可以給你的評論可能是此鏈接:Convert Smart Quotes with PHP

和相關代碼(引用該網頁)

function convert_smart_quotes($string) 
{ 
    $search = array(chr(145), 
        chr(146), 
        chr(147), 
        chr(148), 
        chr(151)); 

    $replace = array("'", 
        "'", 
        '"', 
        '"', 
        '-'); 

    return str_replace($search, $replace, $string); 
} 

(我不在這臺電腦上沒有MS字樣,所以我不能自己測試)

我不記得exac TLY我們在工作中使用(我是不是不得不應付那種輸入的一個),但它是同一種東西...

+0

你將如何指定MS字符? – 2009-08-11 18:17:53

+0

這就是我一直在尋找的東西。謝謝。搜索陣列沒有按原樣工作,我最終使用了上面給出的鏈接的評論中提供的Hex版本。 – 2009-08-11 18:43:18

+1

行:-)感謝您的信息! – 2009-08-11 18:46:28

29

微軟編碼報價是可能的typographic quotation marks。如果您知道要替換它們的字符串編碼,則可以簡單地將它們替換爲str_replace

下面是UTF-8,但使用單一映射陣列的例子與strtr

$quotes = array(
    "\xC2\xAB"  => '"', // « (U+00AB) in UTF-8 
    "\xC2\xBB"  => '"', // » (U+00BB) in UTF-8 
    "\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8 
    "\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8 
    "\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8 
    "\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8 
    "\xE2\x80\x9C" => '"', // 「 (U+201C) in UTF-8 
    "\xE2\x80\x9D" => '"', // 」 (U+201D) in UTF-8 
    "\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8 
    "\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8 
    "\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8 
    "\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8 
); 
$str = strtr($str, $quotes); 

如果你需要另一種編碼,可以使用mb_convert_encoding到鍵轉換。

+0

而不是醜陋的'\ x'轉義,你不能簡單地在你的源文件中包含文字字符嗎? – 2010-10-03 04:50:59

+3

@R ..:這就是問題所在:有很多人不太瞭解字符編碼和/或他們使用什麼字符編碼。 – Gumbo 2010-10-03 06:42:37

+0

非常感謝。喜歡導入excel電子表格到mysql:S +1 – Drewid 2011-09-01 09:35:51

4

我們使用了以下內容。處理幾個特殊字符。

$text = str_replace(chr(130), ',', $text); // baseline single quote 
$text = str_replace(chr(132), '"', $text); // baseline double quote 
$text = str_replace(chr(133), '...', $text); // ellipsis 
$text = str_replace(chr(145), "'", $text); // left single quote 
$text = str_replace(chr(146), "'", $text); // right single quote 
$text = str_replace(chr(147), '"', $text); // left double quote 
$text = str_replace(chr(148), '"', $text); // right double quote 

$text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'); 
+0

在運行替換之前,您應該檢查字符串「$ text」的編碼。它可能已經是一個Unicode字符串了,你正在修改它。 – NobleUplift 2014-04-23 18:53:00

77

我已經找到了這個問題的答案。您只需要一個在PHP中使用iconv()函數的代碼行:

// replace Microsoft Word version of single and double quotations marks (「 」 ‘ ’) with regular quotes (' and ") 
$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);  
+1

這對我很好用! – rushinge 2011-10-18 19:59:22

+0

很高興知道我的答案幫助你:-) – 2011-10-20 06:16:59

+1

太棒了!謝謝!!!這完美地清理MS Word字符! – used2could 2012-05-11 19:30:29

10
如果你像我一樣用的是正在做奇怪的事情,你的CMS或RTE和iconv ISN破ASCII/MS字字符的範圍極廣來到這裏

沒有工作,那麼這個瘋狂的功能可能只是爲了你。

當您將此功能保存到文件時,請確保您的編碼爲utf-8。

<?php 
/** 
* fixMSWord 
* 
* Replace ascii chars with utf8. Note there are ascii characters that don't 
* correctly map and will be replaced by spaces. 
* 
* @author  Robin Cafolla 
* @date  2013-03-22 
* @Copyright (c) 2013 Robin Cafolla 
* @licence  MIT (x11) http://opensource.org/licenses/MIT 
*/ 
function fixMSWord($string) { 
    $map = Array(
     '33' => '!', '34' => '"', '35' => '#', '36' => '$', '37' => '%', '38' => '&', '39' => "'", '40' => '(', '41' => ')', '42' => '*', 
     '43' => '+', '44' => ',', '45' => '-', '46' => '.', '47' => '/', '48' => '0', '49' => '1', '50' => '2', '51' => '3', '52' => '4', 
     '53' => '5', '54' => '6', '55' => '7', '56' => '8', '57' => '9', '58' => ':', '59' => ';', '60' => '<', '61' => '=', '62' => '>', 
     '63' => '?', '64' => '@', '65' => 'A', '66' => 'B', '67' => 'C', '68' => 'D', '69' => 'E', '70' => 'F', '71' => 'G', '72' => 'H', 
     '73' => 'I', '74' => 'J', '75' => 'K', '76' => 'L', '77' => 'M', '78' => 'N', '79' => 'O', '80' => 'P', '81' => 'Q', '82' => 'R', 
     '83' => 'S', '84' => 'T', '85' => 'U', '86' => 'V', '87' => 'W', '88' => 'X', '89' => 'Y', '90' => 'Z', '91' => '[', '92' => '\\', 
     '93' => ']', '94' => '^', '95' => '_', '96' => '`', '97' => 'a', '98' => 'b', '99' => 'c', '100'=> 'd', '101'=> 'e', '102'=> 'f', 
     '103'=> 'g', '104'=> 'h', '105'=> 'i', '106'=> 'j', '107'=> 'k', '108'=> 'l', '109'=> 'm', '110'=> 'n', '111'=> 'o', '112'=> 'p', 
     '113'=> 'q', '114'=> 'r', '115'=> 's', '116'=> 't', '117'=> 'u', '118'=> 'v', '119'=> 'w', '120'=> 'x', '121'=> 'y', '122'=> 'z', 
     '123'=> '{', '124'=> '|', '125'=> '}', '126'=> '~', '127'=> ' ', '128'=> '&#8364;', '129'=> ' ', '130'=> ',', '131'=> ' ', '132'=> '"', 
     '133'=> '.', '134'=> ' ', '135'=> ' ', '136'=> '^', '137'=> ' ', '138'=> ' ', '139'=> '<', '140'=> ' ', '141'=> ' ', '142'=> ' ', 
     '143'=> ' ', '144'=> ' ', '145'=> "'", '146'=> "'", '147'=> '"', '148'=> '"', '149'=> '.', '150'=> '-', '151'=> '-', '152'=> '~', 
     '153'=> ' ', '154'=> ' ', '155'=> '>', '156'=> ' ', '157'=> ' ', '158'=> ' ', '159'=> ' ', '160'=> ' ', '161'=> '¡', '162'=> '¢', 
     '163'=> '£', '164'=> '¤', '165'=> '¥', '166'=> '¦', '167'=> '§', '168'=> '¨', '169'=> '©', '170'=> 'ª', '171'=> '«', '172'=> '¬', 
     '173'=> '­', '174'=> '®', '175'=> '¯', '176'=> '°', '177'=> '±', '178'=> '²', '179'=> '³', '180'=> '´', '181'=> 'µ', '182'=> '¶', 
     '183'=> '·', '184'=> '¸', '185'=> '¹', '186'=> 'º', '187'=> '»', '188'=> '¼', '189'=> '½', '190'=> '¾', '191'=> '¿', '192'=> 'À', 
     '193'=> 'Á', '194'=> 'Â', '195'=> 'Ã', '196'=> 'Ä', '197'=> 'Å', '198'=> 'Æ', '199'=> 'Ç', '200'=> 'È', '201'=> 'É', '202'=> 'Ê', 
     '203'=> 'Ë', '204'=> 'Ì', '205'=> 'Í', '206'=> 'Î', '207'=> 'Ï', '208'=> 'Ð', '209'=> 'Ñ', '210'=> 'Ò', '211'=> 'Ó', '212'=> 'Ô', 
     '213'=> 'Õ', '214'=> 'Ö', '215'=> '×', '216'=> 'Ø', '217'=> 'Ù', '218'=> 'Ú', '219'=> 'Û', '220'=> 'Ü', '221'=> 'Ý', '222'=> 'Þ', 
     '223'=> 'ß', '224'=> 'à', '225'=> 'á', '226'=> 'â', '227'=> 'ã', '228'=> 'ä', '229'=> 'å', '230'=> 'æ', '231'=> 'ç', '232'=> 'è', 
     '233'=> 'é', '234'=> 'ê', '235'=> 'ë', '236'=> 'ì', '237'=> 'í', '238'=> 'î', '239'=> 'ï', '240'=> 'ð', '241'=> 'ñ', '242'=> 'ò', 
     '243'=> 'ó', '244'=> 'ô', '245'=> 'õ', '246'=> 'ö', '247'=> '÷', '248'=> 'ø', '249'=> 'ù', '250'=> 'ú', '251'=> 'û', '252'=> 'ü', 
     '253'=> 'ý', '254'=> 'þ', '255'=> 'ÿ' 
    ); 

    $search = Array(); 
    $replace = Array(); 

    foreach ($map as $s => $r) { 
     $search[] = chr((int)$s); 
     $replace[] = $r; 
    } 

    return str_replace($search, $replace, $string); 
} 
+0

我可以使用這個項目,因爲這是MIT許可證 – 2013-04-11 12:24:08

+0

一般來說,MIT許可證可以讓你以任何你喜歡的方式使用它,只要你不刪除許可證:) – thelastshadow 2013-04-11 12:47:17

+3

你決定把許可證放在什麼地方基本上等於...一個數組? – JMTyler 2014-02-06 23:42:03

3

除了@甘博的會裂傷Unicode字符串前面的答案每一個人:

echo convert_smart_quotes("This is Yi: ꑑ. Point ⒒ this breaks Yi. Yi broke–why? I need a longer––point. This makes Han 嗗 mad."); 

結果:

This is Yi: ?''. Point ?'' this breaks Yi. Yi broke?"why? I need a longer?"?"point. This makes Han ?-- mad. 

的的iconv:

$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input); 

結果在:

PHP Notice: iconv(): Detected an illegal character in input string in php shell code on line 1 

您可以將其更改爲//IGNORE,這將刪除的人物,而不是翻譯。

這是取代在CP1252中編碼的Microsoft報價的最佳方法。如果他們在Unicode並需要更換,使用甘博的回答是:

function convert_cp1252_to_ascii($input, $default = '') { 
    if ($input === null || $input == '') { 
     return $default; 
    } 

    // https://en.wikipedia.org/wiki/UTF-8 
    // https://en.wikipedia.org/wiki/ISO/IEC_8859-1 
    // https://en.wikipedia.org/wiki/Windows-1252 
    // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT 
    $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true); 
    if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') { 
     /* 
     * Use the search/replace arrays if a character needs to be replaced with 
     * something other than its Unicode equivalent. 
     */ 

     $replace = array(
      128 => "E",  // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN 
      129 => "",    // UNDEFINED 
      130 => ",",  // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK 
      131 => "f",  // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK 
      132 => ",,",  // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK 
      133 => "...",  // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL ELLIPSIS 
      134 => "t",  // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER 
      135 => "T",  // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER 
      136 => "^",  // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT 
      137 => "%",  // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN 
      138 => "S",  // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON 
      139 => "<",  // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK 
      140 => "OE",  // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE 
      141 => "",    // UNDEFINED 
      142 => "Z",  // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON 
      143 => "",    // UNDEFINED 
      144 => "",    // UNDEFINED 
      145 => "'",  // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK 
      146 => "'",  // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK 
      147 => "\"",  // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK 
      148 => "\"",  // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK 
      149 => "*",  // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET 
      150 => "-",  // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH 
      151 => "--",  // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH 
      152 => "~",  // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE 
      153 => "TM",  // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN 
      154 => "s",  // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON 
      155 => ">",  // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 
      156 => "oe",  // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE 
      157 => "",    // UNDEFINED 
      158 => "z",  // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON 
      159 => "Y",  // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS 
     ); 

     $find = array(); 
     foreach (array_keys($replace) as $key) { 
      $find[] = chr($key); 
     } 

     $input = str_replace($find, array_values($replace), $input); 
     /* 
     * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F 
     * and control characters, always convert from Windows-1252 to UTF-8. 
     */ 
     $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input); 
    } 
    return $input; 
} 

this answer採取了一些修改。如果您想要控制您查找/替換的內容,請使用該功能。