Q
編輯與口音的距離
1
A
回答
5
In [1]: import unicodedata, string
In [2]: from Levenshtein import distance
In [3]: def remove_accents(data):
...: return ''.join(x for x in unicodedata.normalize('NFKD', data)
...: if x in string.ascii_letters).lower()
In [4]: def norm_dist(s1, s2):
...: norm1, norm2 = remove_accents(s1), remove_accents(s2)
...: d1, d2 = distance(s1, s2), distance(norm1, norm2)
...: return (d1+d2)/2.
In [5]: norm_dist(u'ab', u'ac')
Out[5]: 1.0
In [6]: norm_dist(u'àb', u'ab')
Out[6]: 0.5
2
Unicode允許的重音字符到基座字符加一個組合重音字符分解;例如à
分解成a
後跟一個組合的重音。
要使用標準化形式NFKD轉換這兩個字符串,NFKD分解重音字符並將兼容性字符轉換爲其規範形式,然後使用編輯距離度量標準對插入和刪除之上的替換進行排名。
1
下面是基於difflib和unicodedata的解決方案,沒有任何依賴性:
import unicodedata
from difflib import Differ
# function taken from https://stackoverflow.com/a/517974/1222951
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore').decode()
return only_ascii
def compare(wrong, right):
# normalize both strings to make sure equivalent (but
# different) unicode characters are canonicalized
wrong = unicodedata.normalize('NFKC', wrong)
right = unicodedata.normalize('NFKC', right)
num_diffs = 0
index = 0
differences = list(Differ().compare(wrong, right))
while True:
try:
diff = differences[index]
except IndexError:
break
# diff is a string like "+ a" (meaning the character "a" was inserted)
# extract the operation and the character
op = diff[0]
char = diff[-1]
# if the character isn't equal in both
# strings, increase the difference counter
if op != ' ':
num_diffs += 1
# if a character is wrong, there will be two operations: one
# "+" and one "-" operation
# we want to count this as a single mistake, not as two mistakes
if op in '+-':
try:
next_diff = differences[index+1]
except IndexError:
pass
else:
next_op = next_diff[0]
if next_op in '+-' and next_op != op:
# skip the next operation, we don't want to count
# it as another mistake
index += 1
# we know that the character is wrong, but
# how wrong is it?
# if the only difference is the accent, it's
# a minor mistake
next_char = next_diff[-1]
if remove_accents(char) == remove_accents(next_char):
num_diffs -= 0.5
index += 1
# output the difference as a ratio of
# (# of wrong characters)/(length of longest input string)
return num_diffs/max(len(wrong), len(right))
測試:
for w, r in (('ab','ac'),
('àb','ab'),
('être','etre'),
('très','trés'),
):
print('"{}" and "{}": {}% difference'.format(w, r, compare(w, r)*100))
"ab" and "ac": 50.0% difference
"àb" and "ab": 25.0% difference
"être" and "etre": 12.5% difference
"très" and "trés": 12.5% difference
相關問題
- 1. Python的編輯距離
- 2. 正常化編輯距離
- 3. Levenshtein編輯距離Python
- 4. 選擇性編輯距離
- 5. 編輯距離,扭曲
- 6. 在Python中編輯距離
- 7. 編輯字符串距編輯距離最短的字符串
- 8. Python中的Levenshtein距離只給出1作爲編輯距離
- 9. 如何TF-IDF與編輯距離或哈羅 - 溫克勒距離
- 10. 蟒numpy的成對編輯距離
- 11. 任意序列的Levenshtein /編輯距離
- 12. 關於編輯距離的困惑
- 13. 陣列的編輯距離百分比
- 14. 計算的Levenshtein編輯距離
- 15. 編輯兩個圖之間的距離
- 16. 用不同的字典編輯距離
- 17. ukkonen算法編輯距離的解釋
- 18. 使用交換編輯距離
- 19. 編輯距離遞歸算法 - Skiena
- 20. 如何計算樹編輯距離?
- 21. 編輯距離 - 隨着記憶
- 22. Python多處理編輯距離計算
- 23. 大約編輯距離樹 - 確切編輯路徑
- 24. 谷歌風格搜索建議與Levenshtein編輯距離
- 25. Rails,距離郵編的距離內的所有郵編
- 26. 編輯距離動態編程爲非常大的輸入
- 27. OpenGL啓用剪輯距離
- 28. 距離測量邏輯
- 29. 編輯編號距離特定實施。 PYTHON
- 30. 與最小距離
將不僅僅與非更換重音字母在兩個字符串中重讀,然後計算距離工作? – Dogbert
我第二。使用Unidecode可能有所幫助:https://pypi.python.org/pypi/Unidecode/0.04.1 –
好的謝謝,但在這一點上,我有d('àa','aa')= 0。 – vigte