組通過最大或最小的numpy的陣列

我有兩個相等長度1D numpy的陣列，id和data，其中id是重複的序列，命令對data定義子窗口的整數。例如，組通過最大或最小的numpy的陣列

我想通過在id分組和服用任一最大或最小聚集data。在SQL中，這將是一個典型的聚合查詢，如SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id。有沒有一種方法可以避免Python循環，並以矢量化方式執行此操作，還是必須將其下拉到C？

來源

2011-12-24 Abiel

最近幾天我在堆棧溢出中看到了一些非常類似的問題。下面的代碼與numpy.unique的實現非常相似，因爲它利用了底層的numpy機制，所以它最有可能比在python循環中可以做的更快。

import numpy as np 
def group_min(groups, data): 
    # sort with major key groups, minor key data 
    order = np.lexsort((data, groups)) 
    groups = groups[order] # this is only needed if groups is unsorted 
    data = data[order] 
    # construct an index which marks borders between groups 
    index = np.empty(len(groups), 'bool') 
    index[0] = True 
    index[1:] = groups[1:] != groups[:-1] 
    return data[index] 

#max is very similar 
def group_max(groups, data): 
    order = np.lexsort((data, groups)) 
    groups = groups[order] #this is only needed if groups is unsorted 
    data = data[order] 
    index = np.empty(len(groups), 'bool') 
    index[-1] = True 
    index[:-1] = groups[1:] != groups[:-1] 
    return data[index]

來源

2011-12-24 06:42:10

感謝@Bago，這給了很好的表現。另一件我覺得有用的事情是，它看起來像lexsort將始終將NaN值放在子窗口的末尾。因此，如果我想查找除NaN之外的每個窗口的最大值，我可以翻轉數據的符號，應用最小公式，然後在出路時再次翻轉該符號，只會有小的性能損失。另一方面，如果我實際上想要在子窗口中的任何位置存在NaN時返回NaN值，那麼我會保持原樣。 – Abiel

Abiel，請參閱np.nanmax - max忽略NaNs – denis

尼斯解決方案。令人煩惱的是，當O（n）時間和O（k）存儲器用於k個bin時，它是O（n log n）時間和O（n）內存。也許numpy應該支持'binmax'和'bincount'。 – joeln

我認爲這實現你在找什麼：

[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]

對於外部列表理解，從右到左，set(id)組id S，sorted()排序他們，for k ...遍歷它們，並max在這種情況下，採用另一個列表理解的最大值。因此移動到內部列表理解：enumerate(data)返回data,if id[val] == k的索引和值，對應idk。

此操作遍歷整個data列表中的每個id。通過對子列表進行一些預處理，可能會加快速度，但這不會是一個單線程。

來源

2011-12-24 06:34:43 mtrw

在純Python：

from itertools import groupby, imap, izip 
from operator import itemgetter as ig 

print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))] 
# -> [7, 10, 1]

的變化：

print [data[id==i].max() for i, _ in groupby(id)] 
# -> [7, 10, 1]

基於@Bago's answer：

import numpy as np 

# sort by `id` then by `data` 
ndx = np.lexsort(keys=(data, id)) 
id, data = id[ndx], data[ndx] 

# get max() 
print data[np.r_[np.diff(id), True].astype(np.bool)] 
# -> [ 7 10 1]

如果pandas安裝：

from pandas import DataFrame 

df = DataFrame(dict(id=id, data=data)) 
print df.groupby('id')['data'].max() 
# id 
# 1 7 
# 2 10 
# 3 1

來源

2011-12-24 06:38:33 jfs

感謝@JF所有不同的方法。當然，numpy解決方案比純Python更快，但我很驚訝你的第一個純Python解決方案有多快。我很好奇熊貓解決方案的相對性能;不幸的是我無法測試它，因爲當我嘗試使用最新版本導入DataFrame時出現NameError錯誤。 – Abiel

@Abiel：'pandas .__ version __ =='0.6.1'' – jfs

+1對大熊貓。我認爲其可讀性最簡單。 –

以下解決方案只需要對數據進行排序（而非lexsort），並且不需要查找組之間的邊界。它依賴於一個事實，即如果o是索引數組到r然後r[o] = x將填補r與的o每個值的最新值x，這樣r[[0, 0]] = [1, 2]將返回r[0] = 2。它要求你的組是從0到組數整數 - 1，爲numpy.bincount，並且沒有爲每個組的值：

def group_min(groups, data): 
    n_groups = np.max(groups) + 1 
    result = np.empty(n_groups) 
    order = np.argsort(data)[::-1] 
    result[groups.take(order)] = data.take(order) 
    return result 

def group_max(groups, data): 
    n_groups = np.max(groups) + 1 
    result = np.empty(n_groups) 
    order = np.argsort(data) 
    result[groups.take(order)] = data.take(order) 
    return result

來源

2013-08-11 01:03:54 joeln

比已經接受了一個稍微更快，更一般的答案;就像joeln的回答一樣，它避免了更昂貴的lexsort，並且它適用於任意的ufuncs。此外，它只要求鑰匙是可排序的，而不是在特定範圍內整理。儘管考慮到最大/最小值沒有明確計算，但接受的答案仍然可能會更快。忽視被接受的解決方案的能力是很好的;但也可以簡單地將nan值賦予一個虛擬鍵。

import numpy as np 

def group(key, value, operator=np.add): 
    """ 
    group the values by key 
    any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on) 
    returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts 
    """ 
    #upcast to numpy arrays 
    key = np.asarray(key) 
    value = np.asarray(value) 
    #first, sort by key 
    I = np.argsort(key) 
    key = key[I] 
    value = value[I] 
    #the slicing points of the bins to sum over 
    slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1)) 
    #first entry of each bin is a unique key 
    unique_keys = key[slices] 
    #reduce over the slices specified by index 
    per_key_sum = operator.reduceat(value, slices) 
    #number of counts per key is the difference of our slice points. cap off with number of keys for last bin 
    key_count = np.diff(np.append(slices, len(key))) 
    return unique_keys, per_key_sum, key_count 


names = ["a", "b", "b", "c", "d", "e", "e"] 
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01] 

unique_keys, reduced_values, key_count = group(names, values) 
print 'per group mean' 
print reduced_values/key_count 
unique_keys, reduced_values, key_count = group(names, values, np.minimum) 
print 'per group min' 
print reduced_values 
unique_keys, reduced_values, key_count = group(names, values, np.maximum) 
print 'per group max' 
print reduced_values

來源

2013-12-06 14:59:09

我是相當新的Python和NumPy的，但好像你可以使用的ufunc真是讓人不是reduceat的.at方法：

import numpy as np 
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5]) 
data_val = np.random.rand(len(data_id)) 
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead 
np.maximum.at(ans,data_id,data_val)

例如：

data_val = array([ 0.65753453, 0.84279716, 0.88189818, 0.18987882, 0.49800668, 
    0.29656994, 0.39542769, 0.43155428, 0.77982853, 0.44955868, 
    0.22080219, 0.4807312 , 0.9288989 , 0.10956681, 0.73215416, 
    0.33184318, 0.10936647]) 
ans = array([ 0.98969952, 0.84044947, 0.63460516, 0.92042078, 0.75738113, 
    0.37976055])

當然，這隻有在您的data_id值適合用作索引時纔有意義（即非負整數，並且不大如果他們很大/稀疏，你可以使用np.unique(data_id)或其他東西初始化ans）。

我應該指出data_id實際上並不需要排序。

來源

2014-02-24 21:23:09

我在numpy_indexed包中打包了我以前答案的一個版本;它很高興有這一切包裝和測試在一個整潔的界面;再加上它擁有了更多的功能，以及：

import numpy_indexed as npi 
group_id, group_max_data = group_by(id).max(data)

等等

來源

2016-05-13 20:04:57

組通過最大或最小的numpy的陣列

回答

相關問題