2012-03-20 97 views
1

爲什麼Unicode有幾個保留字符代碼?
請參閱Unicode的兩種語言 - KannadaTamil。 這兩種語言都很古老,我認爲沒有機會獲得這些語言的新字符。
編輯︰然後,他們爲什麼浪費一些字符代碼,使其保留字符代碼?
他們爲什麼不在每個語言字符集的末尾放置保留字符代碼?Unicode中的保留字符代碼

+0

我知道你很好奇,但是你還有另外一個原因嗎? – 2012-03-20 16:20:05

+0

請解釋一下:你的意思是問爲什麼這些塊中有未分配的插槽? – tchrist 2012-03-20 16:21:35

+2

@Oded我認爲你誤解了他的問題,因爲你的問題是一個*非sequitur *。我不確定它甚至是不合時宜的。 – tchrist 2012-03-20 16:22:35

回答

3

這與Unicode聯盟如何分配其分配的塊,腳本和代碼點有關。例如,在Block=Tamil,它開始運行這種方式:

$ unichars '\p{Block=Tamil}' | head -20 
U+00B82 ‭ ◌ஂ GC=Mn SC=Tamil  TAMIL SIGN ANUSVARA 
U+00B83 ‭ ஃ GC=Lo SC=Tamil  TAMIL SIGN VISARGA 
U+00B85 ‭ அ GC=Lo SC=Tamil  TAMIL LETTER A 
U+00B86 ‭ ஆ GC=Lo SC=Tamil  TAMIL LETTER AA 
U+00B87 ‭ இ GC=Lo SC=Tamil  TAMIL LETTER I 
U+00B88 ‭ ஈ GC=Lo SC=Tamil  TAMIL LETTER II 
U+00B89 ‭ உ GC=Lo SC=Tamil  TAMIL LETTER U 
U+00B8A ‭ ஊ GC=Lo SC=Tamil  TAMIL LETTER UU 
U+00B8E ‭ எ GC=Lo SC=Tamil  TAMIL LETTER E 
U+00B8F ‭ ஏ GC=Lo SC=Tamil  TAMIL LETTER EE 
U+00B90 ‭ ஐ GC=Lo SC=Tamil  TAMIL LETTER AI 
U+00B92 ‭ ஒ GC=Lo SC=Tamil  TAMIL LETTER O 
U+00B93 ‭ ஓ GC=Lo SC=Tamil  TAMIL LETTER OO 
U+00B94 ‭ ஔ GC=Lo SC=Tamil  TAMIL LETTER AU 
U+00B95 ‭ க GC=Lo SC=Tamil  TAMIL LETTER KA 
U+00B99 ‭ ங GC=Lo SC=Tamil  TAMIL LETTER NGA 
U+00B9A ‭ ச GC=Lo SC=Tamil  TAMIL LETTER CA 
U+00B9C ‭ ஜ GC=Lo SC=Tamil  TAMIL LETTER JA 
U+00B9E ‭ ஞ GC=Lo SC=Tamil  TAMIL LETTER NYA 
U+00B9F ‭ ட GC=Lo SC=Tamil  TAMIL LETTER TTA 

他們往往保留的4,8,或16碼點的連續行的性格都是一樣的「厚道」。是的,那裏存在差距,但是就像文件系統中的情況一樣,一旦將一個扇區分配給一個文件(或者在塊中沒有單獨的扇區的情況下將其封鎖),即使該文件沒有使用其中的所有文件(最後)部分,你不會將這些未使用的字節分配給其他進程。無論如何,事情往往會被填充以阻止邊界。

這不像我們有任何冒險的代碼風險。

這是分配區域的開始以「符號」開始,如該塊中第一個分配的代碼點所示。差距可能代表一種角色向另一種角色的轉變。如果你在爲他們的屬性塊檢查出前五碼點,你看那些未分配的代碼點仍然有正確的塊屬性:

$ uniprops -a U+00B80 U+00B81 U+00B82 U+00B83 U+00B84 U+00B85 
U+0B80 ‹U+0B80› \N{U+0B80} 
    \pC \p{Cn} 
    All Any InTamil C Other Cn Unassigned Zzzz Unknown 
    Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered 
     CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX 
     Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX 
U+0B81 ‹U+0B81› \N{U+0B81} 
    \pC \p{Cn} 
    All Any InTamil C Other Cn Unassigned Zzzz Unknown 
    Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered 
     CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX 
     Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX 
U+0B82 ‹◌ஂ› \N{TAMIL SIGN ANUSVARA} 
    \w \pM \p{Mn} 
    All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil Case_Ignorable CI M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC 
     Mark Nonspacing_Mark Print Taml Word XID_Continue XIDC X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word 
    Age=1.1 Bidi_Class=Nonspacing_Mark BC=NSM Bidi_Class=NSM Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered 
     CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=EX 
     Grapheme_Cluster_Break=Extend GCB=EX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=T Joining_Type=Transparent JT=T Line_Break=CM Line_Break=Combining_Mark LB=CM Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 
     Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 
     Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=EX Sentence_Break=Extend SB=EX Word_Break=Extend WB=Extend 
U+0B83 ‹ஃ› \N{TAMIL SIGN VISARGA} 
    \w \pL \p{L_} \p{Lo} 
    All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter 
     L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word 
    Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR 
     Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 
     Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 
     Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE 
     Word_Break=LE 
U+0B84 ‹U+0B84› \N{U+0B84} 
    \pC \p{Cn} 
    All Any InTamil C Other Cn Unassigned Zzzz Unknown 
    Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered 
     CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX 
     Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX 
U+0B85 ‹அ› \N{TAMIL LETTER A} 
    \w \pL \p{L_} \p{Lo} 
    All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter 
     L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word 
    Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR 
     Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX 
     Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group 
     JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None 
     Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 
     Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 
     Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE 
     Word_Break=LE 

如果你看看其他分配的內存塊,你會看到相同的排序的東西。把塊分成不相關的東西是沒有意義的。

正如我所說的那樣,它並不像他們將要用盡空間,所以我不知道這裏關注的是什麼。

順便說一句,你可以從我Unicode Command-Line Toolchest得到Unicode的探索和proceesing工具,如unicharsunipropsuninames,無論是從那裏單獨或可通過CPAN Unicode::Tussle suite整個套件。