2013-03-16 55 views
23

在我的項目中,我需要多次計算0-1向量的熵。這是我的代碼:在Python中計算熵的最快方法

def entropy(labels): 
    """ Computes entropy of 0-1 vector. """ 
    n_labels = len(labels) 

    if n_labels <= 1: 
     return 0 

    counts = np.bincount(labels) 
    probs = counts[np.nonzero(counts)]/n_labels 
    n_classes = len(probs) 

    if n_classes <= 1: 
     return 0 
    return - np.sum(probs * np.log(probs))/np.log(n_classes) 

有沒有更快的方法?

+1

什麼是'labels'典型的長度是多少? – unutbu 2013-03-16 14:09:08

+0

長度不固定.. – blueSurfer 2013-03-16 14:12:44

+8

這將有助於基準測試來了解'標籤'的典型值。如果'labels'太短,純粹的python實現實際上可能比使用NumPy更快。 – unutbu 2013-03-16 14:13:55

回答

8

遵循unutbu的建議,我創建了一個純Python實現。

def entropy2(labels): 
""" Computes entropy of label distribution. """ 
    n_labels = len(labels) 

    if n_labels <= 1: 
     return 0 

    counts = np.bincount(labels) 
    probs = counts/n_labels 
    n_classes = np.count_nonzero(probs) 

    if n_classes <= 1: 
     return 0 

    ent = 0. 

    # Compute standard entropy. 
    for i in probs: 
     ent -= i * log(i, base=n_classes) 

    return ent 

我缺少的一點是標籤是一個大數組,但是probs是3或4個元素長。現在使用純python我的應用程序的速度是現在的兩倍。

+2

應該將'base'設置爲類的數量嗎?我認爲自然日誌是標準(以及您在原始問題中使用的內容)。 – joemadeus 2016-06-17 21:22:36

0

以上答案是好的,但如果你需要,可以沿着不同的軸操作版本,這是一個工作實施。

def entropy(A, axis=None): 
    """Computes the Shannon entropy of the elements of A. Assumes A is 
    an array-like of nonnegative ints whose max value is approximately 
    the number of unique values present. 

    >>> a = [0, 1] 
    >>> entropy(a) 
    1.0 
    >>> A = np.c_[a, a] 
    >>> entropy(A) 
    1.0 
    >>> A     # doctest: +NORMALIZE_WHITESPACE 
    array([[0, 0], [1, 1]]) 
    >>> entropy(A, axis=0) # doctest: +NORMALIZE_WHITESPACE 
    array([ 1., 1.]) 
    >>> entropy(A, axis=1) # doctest: +NORMALIZE_WHITESPACE 
    array([[ 0.], [ 0.]]) 
    >>> entropy([0, 0, 0]) 
    0.0 
    >>> entropy([]) 
    0.0 
    >>> entropy([5]) 
    0.0 
    """ 
    if A is None or len(A) < 2: 
     return 0. 

    A = np.asarray(A) 

    if axis is None: 
     A = A.flatten() 
     counts = np.bincount(A) # needs small, non-negative ints 
     counts = counts[counts > 0] 
     if len(counts) == 1: 
      return 0. # avoid returning -0.0 to prevent weird doctests 
     probs = counts/float(A.size) 
     return -np.sum(probs * np.log2(probs)) 
    elif axis == 0: 
     entropies = map(lambda col: entropy(col), A.T) 
     return np.array(entropies) 
    elif axis == 1: 
     entropies = map(lambda row: entropy(row), A) 
     return np.array(entropies).reshape((-1, 1)) 
    else: 
     raise ValueError("unsupported axis: {}".format(axis)) 
2

不依賴於numpy的答案,或者:

import math 
from collections import Counter 

def eta(data, unit='natural'): 
    base = { 
     'shannon' : 2., 
     'natural' : math.exp(1), 
     'hartley' : 10. 
    } 

    if len(data) <= 1: 
     return 0 

    counts = Counter() 

    for d in data: 
     counts[d] += 1 

    probs = [float(c)/len(data) for c in counts.values()] 
    probs = [p for p in probs if p > 0.] 

    ent = 0 

    for p in probs: 
     if p > 0.: 
      ent -= p * math.log(p, base[unit]) 

    return ent 

這將接受任何數據類型,你可以在它扔:

>>> eta(['mary', 'had', 'a', 'little', 'lamb']) 
1.6094379124341005 

>>> eta([c for c in "mary had a little lamb"]) 
2.311097886212714 
12
import pandas as pd 
import scipy as sc 

# Input a pandas series 
def ent(data): 
    p_data= data.value_counts()/len(data) # calculates the probabilities 
    entropy=sc.stats.entropy(p_data) # input probabilities to get the entropy 
    return entropy 
+10

儘管此代碼可能會回答問題,但爲何和/或代碼如何回答此問題提供了其他上下文,這可以提高其長期價值。不鼓勵使用僅有代碼的答案。 – Ajean 2016-08-29 23:14:52

+1

請注意,scipy.stats.entropy已經使概率標準化:https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html。或者,'value_counts'中有一個'normalise'選項:https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html – Nzbuu 2017-12-21 14:30:17

5

我最喜歡的功能對於熵是如下:

def entropy(labels): 
    prob_dict = {x:labels.count(x)/len(labels) for x in labels} 
    probs = np.array(list(prob_dict.values())) 

    return - probs.dot(np.log2(probs)) 

我仍在尋找更好的方法來避免字典 - >值 - >列表 - > np.array轉換。如果我發現它會再次發表評論。

+0

不錯,使用collections.Counter會更好。 – NullPointer 2017-05-20 16:22:42

4

@Sanjeet Gupta答案很好,但可以濃縮。這個問題具體詢問「最快」的方式,但我只看到一個答案的時間,所以我會發表比較使用scipy和numpy的原始海報的entropy2答案略有改動。

四種不同的方法:SciPy的/ numpy的numpy的/數學大熊貓/ numpy的numpy的

import numpy as np 
from scipy.stats import entropy 
from math import log, e 
import pandas as pd 

import timeit 

def entropy1(labels, base=None): 
    value,counts = np.unique(labels, return_counts=True) 
    return entropy(counts, base=base) 

def entropy2(labels, base=None): 
    """ Computes entropy of label distribution. """ 

    n_labels = len(labels) 

    if n_labels <= 1: 
    return 0 

    value,counts = np.unique(labels, return_counts=True) 
    probs = counts/n_labels 
    n_classes = np.count_nonzero(probs) 

    if n_classes <= 1: 
    return 0 

    ent = 0. 

    # Compute entropy 
    base = e if base is None else base 
    for i in probs: 
    ent -= i * log(i, base) 

    return ent 

def entropy3(labels, base=None): 
    vc = pd.Series(labels).value_counts(normalize=True, sort=False) 
    base = e if base is None else base 
    return -(vc * np.log(vc)/np.log(base)).sum() 

def entropy4(labels, base=None): 
    value,counts = np.unique(labels, return_counts=True) 
    norm_counts = counts/counts.sum() 
    base = e if base is None else base 
    return -(norm_counts * np.log(norm_counts)/np.log(base)).sum() 

Timeit操作:

repeat_number = 1000000 

a = timeit.repeat(stmt='''entropy1(labels)''', 
        setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''', 
        repeat=3, number=repeat_number) 

b = timeit.repeat(stmt='''entropy2(labels)''', 
        setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''', 
        repeat=3, number=repeat_number) 

c = timeit.repeat(stmt='''entropy3(labels)''', 
        setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''', 
        repeat=3, number=repeat_number) 

d = timeit.repeat(stmt='''entropy4(labels)''', 
        setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''', 
        repeat=3, number=repeat_number) 

Timeit結果:

# for loop to print out results of timeit 
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]): 
    print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean())) 

Method: scipy/numpy, Avg.: 63.315312 
Method: numpy/math, Avg.: 49.256894 
Method: pandas/numpy, Avg.: 884.644023 
Method: numpy, Avg.: 60.026938 

贏家:numpy的/數學(entropy2)

這也是值得注意的是,上述entropy2函數可以處理數字和文本數據。例如:entropy2(list('abcdefabacdebcab'))。原始海報的答案是從2013年開始的,並且具有分箱整數的特定用例,但不適用於文本。

0

均勻分佈的數據(高熵):

s=range(0,256) 

香農熵計算步驟由步驟:

import collections 

# calculate probability for each byte as number of occurrences/array length 
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()] 
# [0.00390625, 0.00390625, 0.00390625, ...] 

# calculate per-character entropy fractions 
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities] 
# [0.03125, 0.03125, 0.03125, ...] 

# sum fractions to obtain Shannon entropy 
entropy = sum(e_x) 
>>> entropy 
8.0 

一襯墊(假設import collections):

def H(s): return sum([-p_x*math.log(p_x,2) for p_x in [n_x/len(s) for x,n_x in collections.Counter(s).items()]]) 

一個適當的功能:

import collections 

def H(s): 
    probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()] 
    e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]  
    return sum(e_x) 

測試案例 - 從CyberChef entropy estimator採取英文文本:

>>> H(range(0,256)) 
8.0 
>>> H(range(0,64)) 
6.0 
>>> H(range(0,128)) 
7.0 
>>> H([0,1]) 
1.0 
>>> H('Standard English text usually falls somewhere between 3.5 and 5') 
4.228788210509104