2012-04-13 91 views
18

可能重複:
Why UTF-32 exists whereas only 21 bits are necessary to encode every character?爲什麼沒有UTF-24?

最大Unicode代碼點是在0x10FFFF在UTF-32。 UTF-32有21個信息位和11個多餘的空白位。那麼,爲什麼沒有UTF-24編碼(即刪除高位字節的UTF-32)來存儲3個字節而不是4個編碼點?

+0

原因很簡單,因爲沒有一個24的結構。有16,32,8,2,所以24是奇數。加上內存塊通常會以2^x塊的形式出現....其原因computere沒有配備24MB或HDD的相同原因不會以24GB而不是16GB和32GB來標示。 – 2012-04-13 15:42:25

回答

19

嗯,事實是:UTF-24建議:2007年

http://unicode.org/mail-arch/unicode-ml/y2007-m01/0057.html

所提及的優點&缺點是:

"UTF-24 
Advantages: 
1. Fixed length code units. 
2. Encoding format is easily detectable for any content, even if mislabeled. 
3. Byte order can be reliably detected without the use of BOM, even for single-code-unit data. 
4. If octets are dropped/inserted, decoder can resync at next valid code unit. 
5. Practical for both internal processing and storage/interchange. 
6. Conversion to code point scalar values is more trivial then for UTF-16 surrogate pairs 
    and UTF-7/8 multibyte sequences. 
7. 7-bit transparent version can be easily derived. 
8. Most compact for texts in archaic scripts. 
Disadvantages: 
1. Takes more space then UTF-8/16, except for texts in archaic scripts. 
2. Comparing to UTF-32, extra bitwise operations required to convert to code point scalar values. 
3. Incompatible with many legacy text-processing tools and protocols. " 

正如大衛指出,斯坦納在http://www.mail-archive.com/[email protected]/msg16011.html

爲什麼?除非你在正在用舊斜體或哥特式文檔說話,否則UTF-24將幾乎總是大於UTF-16。數學字母數字 字符幾乎總是會與足夠的ASCII結合,以使得UTF-8成爲 贏得勝利,如果沒有足夠的BMP字符使UTF-16成爲贏家。 現代計算機不處理24位塊;在內存中,除非你聲明它們是打包的,否則 佔用32位,然後 它們會比UTF-16或UTF-32慢得多。如果您要存儲到 磁盤,則最好使用BOCU或SCSU(您已經使用非標準 ),或者使用UTF-8,UTF-16,BOCU或 SCSU進行標準壓縮。如果是這樣,SCSU或BOCU壓縮應占用UTF-24的一半空間,即 。

你也可以查看下面的StackOverflow職位:

Why UTF-32 exists whereas only 21 bits are necessary to encode every character?

+2

第二次報價實際上是在幾年前,從2003年起,我回復了我的建議。 – 2012-07-16 15:06:17

+0

個人而言,我希望看到他們擴大。我知道還剩下很多空間,但他們還沒有完成任何語言,尤其是歷史語言。 – Rahly 2015-12-24 00:11:07