2017-04-06 43 views
1

我有一組12個不同的2字節字符串,映射到一組12個相應的1字節字符串,根據以下翻譯字典:使用numpy根據固定的映射將大量的2字節字符串轉換爲相應的1字節字符串

translation_dict = {'AC': '2', 'AG': '3', 'AT': '4', 
        'CA': '5', 'CG': '6', 'CT': '7', 
        'GA': '8', 'GC': '9', 'GT': 'a', 
        'TA': 'b', 'TC': 'c', 'TG': 'd'} 

我需要一些方法用於平移2字節串的巨大numpy.char.array成其相應的1字節的字符串映射,如顯示在下面的例子:

>>> input_array = numpy.char.array(['CA', 'CA', 'GC', 'TC', 'AT', 'GT', 'AG', 'CT']) 
>>> output_array = some_method(input_arr) 
>>> output_array 
chararray(['5', '5', '9', 'c', '4', 'a', '3', '7'], dtype='S1') 

我想知道是否有快速的numpy.char.array方法用於翻譯巨大的2字節字符串數組;我知道我可以使用'numpy.vectorize'和一個明確查找每個2字節密鑰的1字節字典值的函數,但這相對較慢。我無法想出使用numpy.chararray.translate,雖然它似乎只適用於1字節:1字節映射。

+0

定義 「巨大的」。百萬?十億?萬億? –

+0

爲什麼在'numpy.char.array'輸入?那是你可以改變的東西嗎? –

+0

@WarrenWeckesser「巨大」在這種情況下的數十億的秩序和'numpy.char.array'是沒有必要的,我用它的例子,因爲我仍然堅持發現'chararray.translate的巧妙使用' – isosceleswheel

回答

2

對於這樣的搜索操作,NumPy的有np.searchsorted,所以讓我建議它的方法 -

def search_dic(dic, search_keys): 
    # Extract out keys and values 
    k = dic.keys() 
    v = dic.values() 

    # Use searchsorted to locate the indices 
    sidx = np.argsort(k) 
    idx = np.searchsorted(k,search_keys, sorter=sidx) 

    # Finally index and extract out the corresponding values 
    return np.take(v,sidx[idx]) 

採樣運行 -

In [46]: translation_dict = {'AC': '2', 'AG': '3', 'AT': '4', 
    ...:      'CA': '5', 'CG': '6', 'CT': '7', 
    ...:      'GA': '8', 'GC': '9', 'GT': 'a', 
    ...:      'TA': 'b', 'TC': 'c', 'TG': 'd'} 

In [47]: s = np.char.array(['CA', 'CA', 'GC', 'TC', 'AT', 'GT', 'AG', 'CT']) 

In [48]: search_dic(translation_dict, s) 
Out[48]: 
array(['5', '5', '9', 'c', '4', 'a', '3', '7'], 
     dtype='|S1') 
+0

這個效果很好 - 感謝您爲我展示了'numpy.searchsorted'的新應用程序:-) – isosceleswheel

0

關於搜索極小子元素是什麼,並重新索引:

uniq, inv_idx = np.unique(input_array, return_inverse=True) 

np.array([translation_dict[u] for u in uniq])[inv_idx] 

#array(['5', '5', '9', 'c', '4', 'a', '3', '7'], 
# dtype='<U1') 

基準測試:

import time 

x = np.random.choice(list(translation_dict.keys()),1000000) 

t = time.time() 
uniq, inv_idx = np.unique(x, return_inverse=True) 
res = np.array([translation_dict[u] for u in uniq])[inv_idx] 
print("Colonel Beauvel timing is:" + (time.time()-t)) 

t = time.time() 
res = search_dic(translation_dict, x) 
print("Divakar timimng is:" + str(time.time()-t)) 

#Colonel Beauvel timing is:0.32760000228881836 
#Divakar timing is:0.10920000076293945 

Divakar贏得了手,三倍更好!

0

下面是一個使用廉價的 「哈希」 輕微的hackish,但快速的方法:

import numpy as np 
from timeit import timeit 

translation_dict = {'AC': '2', 'AG': '3', 'AT': '4', 
        'CA': '5', 'CG': '6', 'CT': '7', 
        'GA': '8', 'GC': '9', 'GT': 'a', 
        'TA': 'b', 'TC': 'c', 'TG': 'd'} 

keys, values = map(np.char.array, zip(*translation_dict.items())) 

N = 1000000 
mock_data = keys[np.random.randint(0,12,(N,))] 

def lookup(hash_fun, td, data): 
    keys, values = map(np.char.array, zip(*td.items())) 
    keys_ = hash_fun(keys) 
    assert len(set(keys_)) == len(keys) 
    data = hash_fun(data) 
    lookup = np.empty(max(keys_) + 1, values.dtype) 
    lookup[keys_] = values 
    return lookup[data].view(np.chararray) 

def hash_12(table): 
    unit = {8:np.uint32, 4:np.uint16, 2:np.uint8}[table.dtype.itemsize] 
    lookup = table.view(np.ndarray).view(unit) 
    return (lookup[1::2]<<1) + lookup[::2] 

def search_dic(dic, search_keys): 
    # Extract out keys and values 
    k = dic.keys() 
    v = dic.values() 

    # Use searchsorted to locate the indices 
    sidx = np.argsort(k) 
    idx = np.searchsorted(k, search_keys.view(np.ndarray), sorter=sidx) 

    # Finally index and extract out the corresponding values 
    return np.take(v,sidx[idx]) 

def uniq(translation_dict, input_array): 
    uniq, inv_idx = np.unique(input_array, return_inverse=True) 
    return np.char.array([translation_dict[u] for u in uniq])[inv_idx] 


# correctness 
print(np.all(lookup(hash_12, translation_dict, mock_data) 
      == search_dic(translation_dict, mock_data))) 
print(np.all(lookup(hash_12, translation_dict, mock_data) 
      == uniq(translation_dict, mock_data))) 

# performance 
print('C_Beauvel {:9.6f} secs'.format(timeit(lambda: uniq(
    translation_dict, mock_data), number=10)/10)) 
print('Divakar {:9.6f} secs'.format(timeit(lambda: search_dic(
    translation_dict, mock_data), number=10)/10)) 
print('PP  {:9.6f} secs'.format(timeit(lambda: lookup(
    hash_12, translation_dict, mock_data), number=10)/10)) 

打印:

True 
True 
C_Beauvel 0.622123 secs 
Divakar 0.050903 secs 
PP   0.011464 secs