@Sanjeet Gupta答案很好,但可以濃縮。這個問題具體詢問「最快」的方式,但我只看到一個答案的時間,所以我會發表比較使用scipy和numpy的原始海報的entropy2答案略有改動。
四種不同的方法:SciPy的/ numpy的,numpy的/數學,大熊貓/ numpy的,numpy的
import numpy as np
from scipy.stats import entropy
from math import log, e
import pandas as pd
import timeit
def entropy1(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
return entropy(counts, base=base)
def entropy2(labels, base=None):
""" Computes entropy of label distribution. """
n_labels = len(labels)
if n_labels <= 1:
return 0
value,counts = np.unique(labels, return_counts=True)
probs = counts/n_labels
n_classes = np.count_nonzero(probs)
if n_classes <= 1:
return 0
ent = 0.
# Compute entropy
base = e if base is None else base
for i in probs:
ent -= i * log(i, base)
return ent
def entropy3(labels, base=None):
vc = pd.Series(labels).value_counts(normalize=True, sort=False)
base = e if base is None else base
return -(vc * np.log(vc)/np.log(base)).sum()
def entropy4(labels, base=None):
value,counts = np.unique(labels, return_counts=True)
norm_counts = counts/counts.sum()
base = e if base is None else base
return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()
Timeit操作:
repeat_number = 1000000
a = timeit.repeat(stmt='''entropy1(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''',
repeat=3, number=repeat_number)
b = timeit.repeat(stmt='''entropy2(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''',
repeat=3, number=repeat_number)
c = timeit.repeat(stmt='''entropy3(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''',
repeat=3, number=repeat_number)
d = timeit.repeat(stmt='''entropy4(labels)''',
setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''',
repeat=3, number=repeat_number)
Timeit結果:
# for loop to print out results of timeit
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]):
print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean()))
Method: scipy/numpy, Avg.: 63.315312
Method: numpy/math, Avg.: 49.256894
Method: pandas/numpy, Avg.: 884.644023
Method: numpy, Avg.: 60.026938
贏家:numpy的/數學(entropy2)
這也是值得注意的是,上述entropy2
函數可以處理數字和文本數據。例如:entropy2(list('abcdefabacdebcab'))
。原始海報的答案是從2013年開始的,並且具有分箱整數的特定用例,但不適用於文本。
什麼是'labels'典型的長度是多少? – unutbu 2013-03-16 14:09:08
長度不固定.. – blueSurfer 2013-03-16 14:12:44
這將有助於基準測試來了解'標籤'的典型值。如果'labels'太短,純粹的python實現實際上可能比使用NumPy更快。 – unutbu 2013-03-16 14:13:55