Python中使用NumPy的兩個配對列表的平均重複值

在過去，我面臨我自己dealing with averaging two paired lists，我已經使用了成功提供的答案。Python中使用NumPy的兩個配對列表的平均重複值

然而，對於大型（超過20,000）項目，程序有點慢，我想知道是否使用NumPy會使它更快。

names = ["a", "b", "b", "c", "d", "e", "e"] 
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]

我試圖計算出相同值的平均值，因此施用之後，我會得到：

我從兩個列表，花車之一，其中一個字符串開始

result_names = ["a", "b", "c", "d", "e"] 
result_values = [1.2, 4.4, 2.0, 5.67, 8.54]

我把兩個列表結果的例子，但也具有(name, value)元組的列表就足夠了：

result = [("a", 1.2), ("b", 4.4), ("d", 5.67), ("e", 8.54)]

用NumPy做這件事的最好方法是什麼？

來源

2011-10-17 Einar

隨着numpy的，你可以自己寫一些東西，或者你可以用GROUPBY功能（從matplotlib.mlab的rec_groupby功能，但它是慢得多。對於更強大的功能GROUPBY，也許看pandas），和我相比它邁克爾鄧恩與字典答案：

import numpy as np 
import random 
from matplotlib.mlab import rec_groupby 

listA = [random.choice("abcdef") for i in range(20000)] 
listB = [20 * random.random() for i in range(20000)] 

names = np.array(listA) 
values = np.array(listB) 

def f_dict(listA, listB): 
    d = {} 

    for a, b in zip(listA, listB): 
     d.setdefault(a, []).append(b) 

    avg = [] 
    for key in d: 
     avg.append(sum(d[key])/len(d[key])) 

    return d.keys(), avg 

def f_numpy(names, values): 
    result_names = np.unique(names) 
    result_values = np.empty(result_names.shape) 

    for i, name in enumerate(result_names): 
     result_values[i] = np.mean(values[names == name]) 

    return result_names, result_values

這三個結果：

In [2]: f_dict(listA, listB) 
Out[2]: 
(['a', 'c', 'b', 'e', 'd', 'f'], 
[9.9003182717213765, 
    10.077784850173568, 
    9.8623915728699636, 
    9.9790599744319319, 
    9.8811096512807097, 
    10.118695410115953]) 

In [3]: f_numpy(names, values) 
Out[3]: 
(array(['a', 'b', 'c', 'd', 'e', 'f'], 
     dtype='|S1'), 
array([ 9.90031827, 9.86239157, 10.07778485, 9.88110965, 
     9.97905997, 10.11869541])) 

In [7]: rec_groupby(struct_array, ('names',), (('values', np.mean, 'resvalues'),)) 
Out[7]: 
rec.array([('a', 9.900318271721376), ('b', 9.862391572869964), 
     ('c', 10.077784850173568), ('d', 9.88110965128071), 
     ('e', 9.979059974431932), ('f', 10.118695410115953)], 
     dtype=[('names', '|S1'), ('resvalues', '<f8')])

它似乎numpy的是一個有點快了這個測試（和預先定義groupby功能慢得多）：

In [32]: %timeit f_dict(listA, listB) 
10 loops, best of 3: 23 ms per loop 

In [33]: %timeit f_numpy(names, values) 
100 loops, best of 3: 9.78 ms per loop 

In [8]: %timeit rec_groupby(struct_array, ('names',), (('values', np.mean, 'values'),)) 
1 loops, best of 3: 203 ms per loop

來源

2011-10-17 08:23:54 joris

所以這聽起來像numpy是值得的：如果你的腳本這150次的字典解決方案會導致約2秒的延遲。 –

但有一句話，在計時中，我沒有把列表轉換爲numpy數組。這可能會補償numpy的小時間增益（我在上面的例子中測試過，然後f_numpy的速度幾乎相同：19.3 ms）。所以也許這取決於你是否必須每次將列表轉換爲numpy數組。 – joris

就我的測試而言，我沒有看到對轉換列表 - >數組產生巨大影響，但是我承認我沒有在兩個版本之間進行全面的比較。 – Einar

也許一個numpy解決方案比你需要更精細。如果沒有做任何幻想，我發現下面是「閃電般的瞬間」（如，有一個與列表中的20000項沒有noticable等待）：

import random 

listA = [random.choice("abcdef") for i in range(20000)] 
listB = [20 * random.random() for i in range(20000)] 

d = {} 

for a, b in zip(listA, listB): 
    d.setdefault(a, []).append(b) 

for key in d: 
    print key, sum(d[key])/len(d[key])

你milage可能會有所不同，這取決於是否20000是列表的典型長度，無論是在腳本中只做幾次，還是做數百次/幾千次。

來源

2011-10-17 07:43:18

我應該提到它，你是對的：我這樣做了約150次，平均長度約爲20K。 – Einar

有點遲到了，但看到numpy的似乎仍然缺乏這個功能，這裏是一個純粹的numpy的解決方案我最好的嘗試，以實現通過密鑰的分組。它應該比其他建議的解決方案的速度快得多，以適應可觀尺寸的問題集。這裏的關鍵是精簡的reduceat功能。

import numpy as np 

def group(key, value): 
    """ 
    group the values by key 
    returns the unique keys, their corresponding per-key sum, and the keycounts 
    """ 
    #upcast to numpy arrays 
    key = np.asarray(key) 
    value = np.asarray(value) 
    #first, sort by key 
    I = np.argsort(key) 
    key = key[I] 
    value = value[I] 
    #the slicing points of the bins to sum over 
    slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1)) 
    #first entry of each bin is a unique key 
    unique_keys = key[slices] 
    #sum over the slices specified by index 
    per_key_sum = np.add.reduceat(value, slices) 
    #number of counts per key is the difference of our slice points. cap off with number of keys for last bin 
    key_count = np.diff(np.append(slices, len(key))) 
    return unique_keys, per_key_sum, key_count 


names = ["a", "b", "b", "c", "d", "e", "e"] 
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01] 

unique_keys, per_key_sum, key_count = group(names, values) 
print per_key_sum/key_count

來源

2013-12-05 11:55:05

經由numpy的一個簡單的解決方案，假定VA0和VB0作爲numpy.arrays，其通過VA0索引。

import numpy as np 

def avg_group(vA0, vB0): 
    vA, ind, counts = np.unique(vA0, return_index=True, return_counts=True) # get unique values in vA0 
    vB = vB0[ind] 
    for dup in vB[counts>1]: # store the average (one may change as wished) of original elements in vA0 reference by the unique elements in vB 
     vB[np.where(vA==dup)] = np.average(vB0[np.where(vA0==dup)]) 
    return vA, vB

來源

2015-09-15 14:44:11

Python中使用NumPy的兩個配對列表的平均重複值

回答

相關問題