爲什麼Unicode有幾個保留字符代碼?
請參閱Unicode的兩種語言 - Kannada和Tamil。 這兩種語言都很古老,我認爲沒有機會獲得這些語言的新字符。
編輯︰然後,他們爲什麼浪費一些字符代碼,使其保留字符代碼?
他們爲什麼不在每個語言字符集的末尾放置保留字符代碼?Unicode中的保留字符代碼
1
A
回答
3
這與Unicode聯盟如何分配其分配的塊,腳本和代碼點有關。例如,在Block=Tamil
,它開始運行這種方式:
$ unichars '\p{Block=Tamil}' | head -20
U+00B82 ◌ஂ GC=Mn SC=Tamil TAMIL SIGN ANUSVARA
U+00B83 ஃ GC=Lo SC=Tamil TAMIL SIGN VISARGA
U+00B85 அ GC=Lo SC=Tamil TAMIL LETTER A
U+00B86 ஆ GC=Lo SC=Tamil TAMIL LETTER AA
U+00B87 இ GC=Lo SC=Tamil TAMIL LETTER I
U+00B88 ஈ GC=Lo SC=Tamil TAMIL LETTER II
U+00B89 உ GC=Lo SC=Tamil TAMIL LETTER U
U+00B8A ஊ GC=Lo SC=Tamil TAMIL LETTER UU
U+00B8E எ GC=Lo SC=Tamil TAMIL LETTER E
U+00B8F ஏ GC=Lo SC=Tamil TAMIL LETTER EE
U+00B90 ஐ GC=Lo SC=Tamil TAMIL LETTER AI
U+00B92 ஒ GC=Lo SC=Tamil TAMIL LETTER O
U+00B93 ஓ GC=Lo SC=Tamil TAMIL LETTER OO
U+00B94 ஔ GC=Lo SC=Tamil TAMIL LETTER AU
U+00B95 க GC=Lo SC=Tamil TAMIL LETTER KA
U+00B99 ங GC=Lo SC=Tamil TAMIL LETTER NGA
U+00B9A ச GC=Lo SC=Tamil TAMIL LETTER CA
U+00B9C ஜ GC=Lo SC=Tamil TAMIL LETTER JA
U+00B9E ஞ GC=Lo SC=Tamil TAMIL LETTER NYA
U+00B9F ட GC=Lo SC=Tamil TAMIL LETTER TTA
他們往往保留的4,8,或16碼點的連續行的性格都是一樣的「厚道」。是的,那裏存在差距,但是就像文件系統中的情況一樣,一旦將一個扇區分配給一個文件(或者在塊中沒有單獨的扇區的情況下將其封鎖),即使該文件沒有使用其中的所有文件(最後)部分,你不會將這些未使用的字節分配給其他進程。無論如何,事情往往會被填充以阻止邊界。
這不像我們有任何冒險的代碼風險。
這是分配區域的開始以「符號」開始,如該塊中第一個分配的代碼點所示。差距可能代表一種角色向另一種角色的轉變。如果你在爲他們的屬性塊檢查出前五碼點,你看那些未分配的代碼點仍然有正確的塊屬性:
$ uniprops -a U+00B80 U+00B81 U+00B82 U+00B83 U+00B84 U+00B85
U+0B80 ‹U+0B80› \N{U+0B80}
\pC \p{Cn}
All Any InTamil C Other Cn Unassigned Zzzz Unknown
Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX
Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX
U+0B81 ‹U+0B81› \N{U+0B81}
\pC \p{Cn}
All Any InTamil C Other Cn Unassigned Zzzz Unknown
Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX
Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX
U+0B82 ‹◌ஂ› \N{TAMIL SIGN ANUSVARA}
\w \pM \p{Mn}
All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil Case_Ignorable CI M Mn Gr_Ext Grapheme_Extend Graph GrExt ID_Continue IDC
Mark Nonspacing_Mark Print Taml Word XID_Continue XIDC X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=Nonspacing_Mark BC=NSM Bidi_Class=NSM Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=EX
Grapheme_Cluster_Break=Extend GCB=EX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=T Joining_Type=Transparent JT=T Line_Break=CM Line_Break=Combining_Mark LB=CM Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=EX Sentence_Break=Extend SB=EX Word_Break=Extend WB=Extend
U+0B83 ‹ஃ› \N{TAMIL SIGN VISARGA}
\w \pL \p{L_} \p{Lo}
All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter
L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE
Word_Break=LE
U+0B84 ‹U+0B84› \N{U+0B84}
\pC \p{Cn}
All Any InTamil C Other Cn Unassigned Zzzz Unknown
Age=Unassigned Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered
CCC=NR Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=Unknown LB=XX Line_Break=XX Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=Unassigned IN=Unassigned Script=Unknown SC=Zzzz Script=Zzzz Sentence_Break=Other SB=XX
Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX
U+0B85 ‹அ› \N{TAMIL LETTER A}
\w \pL \p{L_} \p{Lo}
All Any Alnum Alpha Alphabetic Assigned InTamil Tamil Is_Tamil L Lo Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC ID_Start IDS Letter
L_ Other_Letter Print Taml Word XID_Continue XIDC XID_Start XIDS X_POSIX_Alnum X_POSIX_Alpha X_POSIX_Graph X_POSIX_Print X_POSIX_Word
Age=1.1 Bidi_Class=L Bidi_Class=Left_To_Right BC=L Block=Tamil Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR
Canonical_Combining_Class=NR Decomposition_Type=None DT=None East_Asian_Width=Neutral Grapheme_Cluster_Break=Other GCB=XX
Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group
JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=AL Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None
Numeric_Value=NaN NV=NaN Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1
Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2
Present_In=6.0 IN=6.0 Script=Tamil SC=Taml Script=Taml Sentence_Break=LE Sentence_Break=OLetter SB=LE Word_Break=ALetter WB=LE
Word_Break=LE
如果你看看其他分配的內存塊,你會看到相同的排序的東西。把塊分成不相關的東西是沒有意義的。
正如我所說的那樣,它並不像他們將要用盡空間,所以我不知道這裏關注的是什麼。
順便說一句,你可以從我Unicode Command-Line Toolchest得到Unicode的探索和proceesing工具,如unichars,uniprops,uninames,無論是從那裏單獨或可通過CPAN Unicode::Tussle
suite整個套件。
相關問題
- 1. Python unicode字符代碼?
- 2. Unicode代碼字符串
- 3. 保留在Java中的Unicode
- 4. 從前面的設置中保留字符的C代碼
- 5. 在哈希代碼查詢字符串使用保留字符
- 6. Java代碼以保留任意輸入中的特殊字符
- 7. 運行源代碼中使用Unicode字符的Python 2.7代碼
- 8. C獲取字符的Unicode代碼點
- 9. 如何獲取Unicode字符的代碼?
- 10. 使用unicode代碼顯示字符
- 11. Unicode代碼點和java字符
- 12. 從Unicode代碼點獲取字符 - C++
- 13. 打印unicode代碼而不是字符
- 14. Unicode替代字符編碼c#
- 15. 「修飾符字母右下箭頭」的Unicode字符代碼
- 16. unicode的亂碼字符
- 17. 轉換Unicode字符代碼指向文字字符
- 18. ResultSet getString字符編碼的Unicode字符
- 19. Unicode字符«保存»,«打印»
- 20. 使用XSLT轉換XML並保留Unicode字符
- 21. 如何將Unicode代碼點轉換爲Unicode字符串?
- 22. 字符串到Unicode和Unicode到十進制代碼點(C++)
- 23. Ruby 1.9,Rails 3和Unicode:代碼將無法識別Unicode字符
- 24. UTF-16保留代碼點
- 25. 的Python:解碼同時包含Unicode代碼點的字符串和Unicode文本
- 26. 保留字符串的Xml編碼字符
- 27. 保留字符和符號
- 28. 用python代碼中的長字符串替換文件中的字符串(保留佔位符)
- 29. 如何在VB.Net中表示Unicode Chr代碼字符串文字?
- 30. python中的unicode字符串的補充代碼點
我知道你很好奇,但是你還有另外一個原因嗎? – 2012-03-20 16:20:05
請解釋一下:你的意思是問爲什麼這些塊中有未分配的插槽? – tchrist 2012-03-20 16:21:35
@Oded我認爲你誤解了他的問題,因爲你的問題是一個*非sequitur *。我不確定它甚至是不合時宜的。 – tchrist 2012-03-20 16:22:35