什麼是字符向量的R排序規則？

R按照我描述爲字母而不是ASCII的順序對字符向量進行排序。什麼是字符向量的R排序規則？

例如：

sort(c("dog", "Cat", "Dog", "cat")) 
[1] "cat" "Cat" "dog" "Dog"

三個問題：

什麼是技術上是正確的術語來描述這種排序順序？
我在CRAN的手冊中找不到任何有關這方面的參考資料。我在哪裏可以找到R中排序規則的描述？
與C，Java，Perl或PHP等其他語言中的這種行爲有什麼不同？

來源

2011-08-29 Andrie

相關[不要忽略大小寫排序字符串]（http://stackoverflow.com/q/4245196/271616）。 –

Details:爲sort()狀態：

The sort order for character vectors will depend on the collating 
sequence of the locale in use: see ‘Comparison’. The sort order 
for factors is the order of their levels (which is particularly 
appropriate for ordered factors).

和help(Comparison)然後顯示：

Comparison of strings in character vectors is lexicographicwithin 
the strings using the collating sequence of the locale in use:see 
‘locales’. The collating sequence of locales such as ‘en_US’ is 
normally different from ‘C’ (which should use ASCII) and can be 
surprising. Beware of making _any_ assumptions about the 
collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’, 
and collation is not necessarily character-by-character - in 
Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’ 
may or may not be a single sorting unit: if it is it follows ‘g’. 
Some platforms may not respect the locale and always sort in 
numerical order of the bytes in an 8-bit locale, or in Unicode 
point order for a UTF-8 locale (and may not sort in the same order 
for the same language in different character sets). Collation of 
non-letters (spaces, punctuation signs, hyphens, fractions and so 
on) is even more problematic.

所以它取決於你的區域設置。

來源

2011-08-29 11:24:47

D'oh。我試圖在http://cran.r-project.org/doc/manuals/R-ints.html中找到它。謝謝。 – Andrie

我不會試圖改進Dirk和幫助的描述，但在R之外，可能會發現它被描述爲詞典排序，雖然不區分大小寫。整理規則是一個嚴肅的考慮因素，因爲天真的文本處理通常是針對英文順序進行的，這對其他一些語言來說是不利的。一個很好的例子是，它使得姓名排序對於母語人士*或僅以26個字母以嚴格的A-Z順序思考的人來說看起來很奇怪。 – Iterator

，我剛剛花了很長時間才發現空格字符可能會或可能不會被忽略，並且這取決於我是在本地運行測試，還是在執行'R CMD檢查' –

什麼是字符向量的R排序規則？

回答

相關問題