二進制搜索（對開），用於在Python 2.7

Unicode字符我有一個排序2048 Japanese Unicode words在理想情況下，我想指數使用二進制搜索，如this SO question ("Binary search (bisection) in Python)討論名單，J。 binary_search功能使用bisect模塊中的bisect_left，而不是使用list.index函數對列表編制索引。二進制搜索（對開），用於在Python 2.7

import bisect 
def binary_search(a, x, lo=0, hi=None): # can't use a to specify default for hi 
    hi = hi if hi is not None else len(a) # hi defaults to len(a) 
    pos = bisect.bisect_left(a, x, lo, hi) # find insertion point 
    return (pos if pos != hi and a[pos] == x else -1)

現在，讓我們指數words每個字：

words = "そつう　れきだい　ほんやく　わかす　りくつ　ばいか　ろせん　やちん　そつう　れきだい　ほんやく　わかめ"

我們分給列表：

>>> words.split() 
[u'\u305d\u3064\u3046', u'\u308c\u304d\u305f\u3099\u3044', u'\u307b\u3093\u3084\u304f', u'\u308f\u304b\u3059', u'\u308a\u304f\u3064', u'\u306f\u3099\u3044\u304b', u'\u308d\u305b\u3093', u'\u3084\u3061\u3093', u'\u305d\u3064\u3046', u'\u308c\u304d\u305f\u3099\u3044', u'\u307b\u3093\u3084\u304f', u'\u308f\u304b\u3081']`)

比較binary_search與list.index：

>>> [ binary_search(J, x) for x in words.split()] 
[-1, 2015, 1790, 2039, 1983, -1, 2031, 1919, -1, 2015, 1790, 2040] 
>>> [ J.index(x) for x in words.split()] 
[1019, 2015, 1790, 2039, 1983, 1533, 2031, 1919, 1019, 2015, 1790, 2040]

所以對於u'\u305d\u3064\u3046'（そつう），而不是1019的索引，binary_search正在返回的索引-1（失敗），因爲pos評估爲，u'\u305d\u3068\u304b\u3099\u308f'。 這兩個詞（索引1019 & 1027）都以u'\u305d'開頭，所以這就是問題出現的地方。

如何修改binary_search以查找（日文）Unicode字符的索引？

來源

2015-07-01 Aussie Cryptocurrency

二進制搜索需要對輸入列表進行排序。您的輸入是否分類？ –

什麼是J？這不是你的'words.split（）'列表。你不能在這裏使用'words'值，因爲這不是一個列表，而是一個字符串。 –

@MartijnPieters列表中，J，*是*排序。該代碼採用Unicode助記符字符串;由於日文中使用的表意空間，我簡化了它。 Binary_search是從列表中索引單詞，因此拆分 –

看起來您的J列表在這裏是錯誤的;它沒有被正確分類。

當我使用requests庫加載測試數據文件和排序的信息，它工作正常：

>>> # your bisect function defined 
... 
>>> import requests 
>>> words = u"そつう　れきだい　ほんやく　わかす　りくつ　ばいか　ろせん　やちん　そつう　れきだい　ほんやく　わかめ" 
>>> url = 'https://raw.githubusercontent.com/trezor/python-mnemonic/master/mnemonic/wordlist/japanese.txt' 
>>> J = sorted(requests.get(url).text.splitlines()) 
>>> [ J.index(x) for x in words.split()] 
[1024, 2015, 1787, 2039, 1983, 1601, 2031, 1919, 1024, 2015, 1787, 2040] 
>>> [ binary_search(J, x) for x in words.split()] 
[1024, 2015, 1787, 2039, 1983, 1601, 2031, 1919, 1024, 2015, 1787, 2040]

注意，指數不太符合你的測試結果;如果指數超過3，則平分線不能找到結果。

來源

2015-07-01 08:52:09

二進制搜索（對開），用於在Python 2.7

回答

相關問題