R按照我描述爲字母而不是ASCII的順序對字符向量進行排序。什麼是字符向量的R排序規則?
例如:
sort(c("dog", "Cat", "Dog", "cat"))
[1] "cat" "Cat" "dog" "Dog"
三個問題:
- 什麼是技術上是正確的術語來描述這種排序順序?
- 我在CRAN的手冊中找不到任何有關這方面的參考資料。我在哪裏可以找到R中排序規則的描述?
- 與C,Java,Perl或PHP等其他語言中的這種行爲有什麼不同?
R按照我描述爲字母而不是ASCII的順序對字符向量進行排序。什麼是字符向量的R排序規則?
例如:
sort(c("dog", "Cat", "Dog", "cat"))
[1] "cat" "Cat" "dog" "Dog"
三個問題:
Details:
爲sort()
狀態:
The sort order for character vectors will depend on the collating sequence of the locale in use: see ‘Comparison’. The sort order for factors is the order of their levels (which is particularly appropriate for ordered factors).
和help(Comparison)
然後顯示:
Comparison of strings in character vectors is lexicographicwithin the strings using the collating sequence of the locale in use:see ‘locales’. The collating sequence of locales such as ‘en_US’ is normally different from ‘C’ (which should use ASCII) and can be surprising. Beware of making _any_ assumptions about the collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’, and collation is not necessarily character-by-character - in Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’ may or may not be a single sorting unit: if it is it follows ‘g’. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.
所以它取決於你的區域設置。
D'oh。我試圖在http://cran.r-project.org/doc/manuals/R-ints.html中找到它。謝謝。 – Andrie
我不會試圖改進Dirk和幫助的描述,但在R之外,可能會發現它被描述爲詞典排序,雖然不區分大小寫。整理規則是一個嚴肅的考慮因素,因爲天真的文本處理通常是針對英文順序進行的,這對其他一些語言來說是不利的。一個很好的例子是,它使得姓名排序對於母語人士*或僅以26個字母以嚴格的A-Z順序思考的人來說看起來很奇怪。 – Iterator
,我剛剛花了很長時間才發現空格字符可能會或可能不會被忽略,並且這取決於我是在本地運行測試,還是在執行'R CMD檢查' –
相關[不要忽略大小寫排序字符串](http://stackoverflow.com/q/4245196/271616)。 –