2017-05-07 20 views
1

我試圖按字典順序排列數組組件。下面的代碼工作正常,但我想分配相等的元素相等的排名。指定相同的詞典排名以複製2d數組的元素

import numpy as np 

values = np.asarray([ 
    [1, 2, 3], 
    [1, 1, 1], 
    [2, 2, 3], 
    [1, 2, 3], 
    [1, 1, 2] 
]) 
# need to flip, because for `np.lexsort` last 
# element has highest priority. 
values_reversed = np.fliplr(values) 
# this returns the order, i.e. the order in 
# which the elements should be in a sorted 
# array (not the rank by index). 
order = np.lexsort(values_reversed.T) 
# convert order to ranks. 
n = values.shape[0] 
ranks = np.empty(n, dtype=int) 
# use order to assign ranks. 
ranks[order] = np.arange(n) 

秩變量包含[2, 0, 4, 3, 1],但由於元件[1, 2, 3](索引0和3)共享相同的秩的[2, 0, 4, 2, 1]秩陣列是必需的。連續排名數字沒問題,所以[2, 0, 3, 2, 1]也是可以接受的排名數組。

回答

1

這裏有一個方法 -

# Get lexsorted indices and hence sorted values by those indices 
lexsort_idx = np.lexsort(values.T[::-1]) 
lexsort_vals = values[lexsort_idx] 

# Mask of steps where rows shift (there are no duplicates in subsequent rows) 
mask = np.r_[True,(lexsort_vals[1:] != lexsort_vals[:-1]).any(1)] 

# Get the stepped indices (indices shift at non duplicate rows) and 
# the index values are scaled corresponding to row numbers  
stepped_idx = np.maximum.accumulate(mask*np.arange(mask.size))  

# Re-arrange the stepped indices based on the original order of rows 
# This is basically same as the original code does in last 4 steps, 
# just in a concise manner 
out_idx = stepped_idx[lexsort_idx.argsort()] 

樣一步一步的中間產出 -

In [55]: values 
Out[55]: 
array([[1, 2, 3], 
     [1, 1, 1], 
     [2, 2, 3], 
     [1, 2, 3], 
     [1, 1, 2]]) 

In [56]: lexsort_idx 
Out[56]: array([1, 4, 0, 3, 2]) 

In [57]: lexsort_vals 
Out[57]: 
array([[1, 1, 1], 
     [1, 1, 2], 
     [1, 2, 3], 
     [1, 2, 3], 
     [2, 2, 3]]) 

In [58]: mask 
Out[58]: array([ True, True, True, False, True], dtype=bool) 

In [59]: stepped_idx 
Out[59]: array([0, 1, 2, 2, 4]) 

In [60]: lexsort_idx.argsort() 
Out[60]: array([2, 0, 4, 3, 1]) 

In [61]: stepped_idx[lexsort_idx.argsort()] 
Out[61]: array([2, 0, 4, 2, 1]) 

性能提升

更多的性能效率來計算lexsort_idx.argsort(),我們可以使用和t他的是相同的最後4行的原碼 -

def argsort_unique(idx): 
    # Original idea : http://stackoverflow.com/a/41242285/3293881 by @Andras 
    n = idx.size 
    sidx = np.empty(n,dtype=int) 
    sidx[idx] = np.arange(n) 
    return sidx 

因此,lexsort_idx.argsort()可以與argsort_unique(lexsort_idx)可替代地計算。


運行測試

應用一些更多的優化技巧,我們將有一個版本,像這樣 -

def numpy_app(values): 
    lexsort_idx = np.lexsort(values.T[::-1]) 
    lexsort_v = values[lexsort_idx] 
    mask = np.concatenate(([False],(lexsort_v[1:] == lexsort_v[:-1]).all(1))) 

    stepped_idx = np.arange(mask.size) 
    stepped_idx[mask] = 0 
    np.maximum.accumulate(stepped_idx, out=stepped_idx) 

    return stepped_idx[argsort_unique(lexsort_idx)] 

@Warren Weckesser的rankdata爲基礎的方法作爲定時的FUNC -

def scipy_app(values): 
    v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1]))) 
    return rankdata(v, method='min') - 1 

計時 -

In [97]: a = np.random.randint(0,9,(10000,3)) 

In [98]: out1 = numpy_app(a) 

In [99]: out2 = scipy_app(a) 

In [100]: np.allclose(out1, out2) 
Out[100]: True 

In [101]: %timeit scipy_app(a) 
100 loops, best of 3: 5.32 ms per loop 

In [102]: %timeit numpy_app(a) 
100 loops, best of 3: 1.96 ms per loop 
+0

你介意解釋的步驟是什麼?它似乎工作正常,但我不太明白如何... – orange

0

這裏有一個辦法做到這一點使用scipy.stats.rankdata(與method='min'),通過查看2-d陣列作爲1-d結構數組:

In [15]: values 
Out[15]: 
array([[1, 2, 3], 
     [1, 1, 1], 
     [2, 2, 3], 
     [1, 2, 3], 
     [1, 1, 2]]) 

In [16]: v = values.view(np.dtype(','.join([values.dtype.str]*values.shape[1]))) 

In [17]: rankdata(v, method='min') - 1 
Out[17]: array([2, 0, 4, 2, 1])