2017-05-24 64 views
3

有沒有一種方法可以計算沿着nD陣列軸的許多直方圖?我公司目前擁有該方法使用for循環遍歷所有其它的軸,並計算numpy.histogram()每個產生的一維數組:沿着軸計算直方圖

import numpy 
import itertools 
data = numpy.random.rand(4, 5, 6) 

# axis=-1, place `200001` and `[slice(None)]` on any other position to process along other axes 
out = numpy.zeros((4, 5, 200001), dtype="int64") 
indices = [ 
    numpy.arange(4), numpy.arange(5), [slice(None)] 
] 

# Iterate over all axes, calculate histogram for each cell 
for idx in itertools.product(*indices): 
    out[idx] = numpy.histogram(
     data[idx], 
     bins=2 * 100000 + 1, 
     range=(-100000 - 0.5, 100000 + 0.5), 
    )[0] 

out.shape # (4, 5, 200001) 

不用說,這是很慢的,但是我無法找到一個方法來解決這使用numpy.histogram,numpy.histogram2dnumpy.histogramdd

+1

見https://stackoverflow.com/questions/40018125/binning-of-data-along-one-axis-in - numpy但我不知道這是否比循環更快。 – kazemakase

+0

它們的速度大致相同(但沿軸應用更好閱讀)。 –

+0

檢查您是否可以使用[np.histogramdd](https://docs.scipy.org/doc/numpy-1.12.0/reference/generated/numpy.histogramdd.html) – MaxU

回答

3

下面是使用高效工具np.searchsortednp.bincount的矢量化方法。 searchsorted爲我們提供了每個元素將根據垃圾箱放置的情況,bincount爲我們計數。

實現 -

def hist_laxis(data, n_bins, range_limits): 
    # Setup bins and determine the bin location for each element for the bins 
    R = range_limits 
    N = data.shape[-1] 
    bins = np.linspace(R[0],R[1],n_bins+1) 
    data2D = data.reshape(-1,N) 
    idx = np.searchsorted(bins, data2D,'right')-1 

    # Some elements would be off limits, so get a mask for those 
    bad_mask = (idx==-1) | (idx==n_bins) 

    # We need to use bincount to get bin based counts. To have unique IDs for 
    # each row and not get confused by the ones from other rows, we need to 
    # offset each row by a scale (using row length for this). 
    scaled_idx = n_bins*np.arange(data2D.shape[0])[:,None] + idx 

    # Set the bad ones to be last possible index+1 : n_bins*data2D.shape[0] 
    limit = n_bins*data2D.shape[0] 
    scaled_idx[bad_mask] = limit 

    # Get the counts and reshape to multi-dim 
    counts = np.bincount(scaled_idx.ravel(),minlength=limit+1)[:-1] 
    counts.shape = data.shape[:-1] + (n_bins,) 
    return counts 

運行測試

原始的方法 -

def org_app(data, n_bins, range_limits): 
    R = range_limits 
    m,n = data.shape[:2] 
    out = np.zeros((m, n, n_bins), dtype="int64") 
    indices = [ 
     np.arange(m), np.arange(n), [slice(None)] 
    ] 

    # Iterate over all axes, calculate histogram for each cell 
    for idx in itertools.product(*indices): 
     out[idx] = np.histogram(
      data[idx], 
      bins=n_bins, 
      range=(R[0], R[1]), 
     )[0] 
    return out 

時序和驗證 -

In [2]: data = np.random.randn(4, 5, 6) 
    ...: out1 = org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 
    ...: out2 = hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 
    ...: print np.allclose(out1, out2) 
    ...: 
True 

In [3]: %timeit org_app(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 
10 loops, best of 3: 39.3 ms per loop 

In [4]: %timeit hist_laxis(data, n_bins=200001, range_limits=(- 2.5, 2.5)) 
100 loops, best of 3: 3.17 ms per loop 

因爲在loopy解決方案中,我們正在循環前兩個軸。所以,讓我們增加它們的長度,因爲這將告訴我們如何好是一個矢量 -

In [59]: data = np.random.randn(400, 500, 6) 

In [60]: %timeit org_app(data, n_bins=21, range_limits=(- 2.5, 2.5)) 
1 loops, best of 3: 9.59 s per loop 

In [61]: %timeit hist_laxis(data, n_bins=21, range_limits=(- 2.5, 2.5)) 
10 loops, best of 3: 44.2 ms per loop 

In [62]: 9590/44.2   # Speedup number 
Out[62]: 216.9683257918552 
+0

你能舉個例子嗎?我無法使用'counts.shape = data.shape [: - 1] +(n_bins,):ValueError:無法將大小爲3900020的數組重新整形(4,5,200001)' –

+0

@NilsWerner需要帶有'bincount'的'minlength'標準。請檢查編輯。 – Divakar

+0

非常聰明的做法,謝謝! –