我應該使用Python casefold嗎？

最近在忽略大小寫的情況下閱讀了casefold和字符串比較。我讀過MSDN標準是使用InvariantCulture，並且絕對避免使用小寫字母。然而，從我讀過的案例來看，就像是一個更積極的小寫案例。我的問題是我應該在Python中使用casefold還是使用pythonic標準來代替？另外，casefold是否通過土耳其測試？我應該使用Python casefold嗎？

來源

2016-10-31 FlyingLightning

1. [casefold]做了什麼解釋[在文檔]（https://docs.python.org/3/library/stdtypes.html#str.casefold）。 2.在這種情況下，「更好」意味着什麼？ 3.什麼是土耳其測試（你有沒有試過運行它來找出答案）？ – jonrsharpe

@jonrsharpe對不起，這意味着更pythonic，也意味着土耳其測試。我只是想知道當他們想在Python中做無用的比較時好的程序員使用什麼。 – FlyingLightning

@jonrsharpe - 土耳其測試在這裏有更詳細的描述http://stackoverflow.com/a/797043/135978 –

1）在Python 3中，應該使用casefold()來實現無格式字符串匹配。

因爲Python 3.0，字符串存儲爲Unicode。 The Unicode Standard Chapter 3.13定義默認無殼匹配如下：

的字符串X爲串Y上無殼匹配當且僅當：
toCasefold（X）= toCasefold（Y）

Python's casefold() implements the Unicode's toCasefold().所以它應該用來實現無格式字符串匹配。雖然單獨的案例不足以覆蓋一些角落案例並通過土耳其測試（參見第3點）。

2）的Python 3.6的，casefold（）不能通過土耳其測試。

對於兩個字符，大寫字母I和虛線大寫I，the Unicode Standard defines two different casefolding mappings.

默認（對於非突厥語言）：
I→I（U + 0049→U + 0069）
İ →I（U + 0130→U + 0069 U + 0307）

的選擇項（突厥語言）：
我→I（U + 0049→U + 0131）
©i（U + 0130→U + 0069）

pythons casefold()只能應用默認映射並且不能通過土耳其測試。例如，土耳其詞語「利馬尼」和「利馬尼」是無殼的等同物，但"LİMANI".casefold() == "limanı".casefold()返回False。沒有選項來啓用替代映射。

3）如何做到不區分大小寫的字符串匹配在Python 3

The Unicode Standard Chapter 3.13介紹幾種區分大小寫匹配算法。該規範casless匹配可能會適合大多數使用情況。這個算法已經考慮到了所有的角落案例。我們只需要添加一個選項來在非突厥語和突厥語casefolding之間切換。

import unicodedata 

def normalize_NFD(string): 
    return unicodedata.normalize('NFD', string) 

def casefold_(string, include_special_i=False): 
    if include_special_i: 
     string = unicodedata.normalize('NFC', string) 
     string = string.replace('\u0049', '\u0131') 
     string = string.replace('\u0130', '\u0069') 
    return string.casefold() 

def casefold_NFD(string, include_special_i=False): 
    return normalize_NFD(casefold_(normalize_NFD(string), include_special_i)) 

def caseless_match(string1, string2, include_special_i=False): 
    return casefold_NFD(string1, include_special_i) == casefold_NFD(string2, include_special_i)

casefold_()是Python的casefold()的包裝。如果它的參數include_special_i設置爲True，然後將其應用突厥映射，如果它被設置爲False默認映射使用。

caseless_match()爲string1和string2做了規範無價值匹配。如果字符串是突厥語字符，則include_special_i參數必須設置爲True。

例子：

caseless_match('LİMANI', 'limanı', include_special_i=True) 真

caseless_match('LİMANI', 'limanı') 假

caseless_match('INTENSIVE', 'intensive', include_special_i=True) 假

caseless_match('INTENSIVE', 'intensive') 真

來源

2016-12-24 18:51:03 SergiyKolesnikov

我應該使用Python casefold嗎？

回答

相關問題