2011-08-29 47 views
15

R按照我描述爲字母而不是ASCII的順序對字符向量進行排序。什麼是字符向量的R排序規則?

例如:

sort(c("dog", "Cat", "Dog", "cat")) 
[1] "cat" "Cat" "dog" "Dog" 

三個問題:

  1. 什麼是技術上是正確的術語來描述這種排序順序?
  2. 我在CRAN的手冊中找不到任何有關這方面的參考資料。我在哪裏可以找到R中排序規則的描述?
  3. 與C,Java,Perl或PHP等其他語言中的這種行爲有什麼不同?
+0

相關[不要忽略大小寫排序字符串](http://stackoverflow.com/q/4245196/271616)。 –

回答

21

Details:sort()狀態:

The sort order for character vectors will depend on the collating 
sequence of the locale in use: see ‘Comparison’. The sort order 
for factors is the order of their levels (which is particularly 
appropriate for ordered factors). 

help(Comparison)然後顯示:

Comparison of strings in character vectors is lexicographicwithin 
the strings using the collating sequence of the locale in use:see 
‘locales’. The collating sequence of locales such as ‘en_US’ is 
normally different from ‘C’ (which should use ASCII) and can be 
surprising. Beware of making _any_ assumptions about the 
collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’, 
and collation is not necessarily character-by-character - in 
Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’ 
may or may not be a single sorting unit: if it is it follows ‘g’. 
Some platforms may not respect the locale and always sort in 
numerical order of the bytes in an 8-bit locale, or in Unicode 
point order for a UTF-8 locale (and may not sort in the same order 
for the same language in different character sets). Collation of 
non-letters (spaces, punctuation signs, hyphens, fractions and so 
on) is even more problematic. 

所以它取決於你的區域設置。

+1

D'oh。我試圖在http://cran.r-project.org/doc/manuals/R-ints.html中找到它。謝謝。 – Andrie

+3

我不會試圖改進Dirk和幫助的描述,但在R之外,可能會發現它被描述爲詞典排序,雖然不區分大小寫。整理規則是一個嚴肅的考慮因素,因爲天真的文本處理通常是針對英文順序進行的,這對其他一些語言來說是不利的。一個很好的例子是,它使得姓名排序對於母語人士*或僅以26個字母以嚴格的A-Z順序思考的人來說看起來很奇怪。 – Iterator

+0

,我剛剛花了很長時間才發現空格字符可能會或可能不會被忽略,並且這取決於我是在本地運行測試,還是在執行'R CMD檢查' –