在Python中計算熵的最快方法

在我的項目中，我需要多次計算0-1向量的熵。這是我的代碼：在Python中計算熵的最快方法

def entropy(labels): 
    """ Computes entropy of 0-1 vector. """ 
    n_labels = len(labels) 

    if n_labels <= 1: 
     return 0 

    counts = np.bincount(labels) 
    probs = counts[np.nonzero(counts)]/n_labels 
    n_classes = len(probs) 

    if n_classes <= 1: 
     return 0 
    return - np.sum(probs * np.log(probs))/np.log(n_classes)

有沒有更快的方法？

來源

2013-03-16 blueSurfer

什麼是'labels'典型的長度是多少？ – unutbu 2013-03-16 14:09:08

長度不固定.. – blueSurfer 2013-03-16 14:12:44

這將有助於基準測試來了解'標籤'的典型值。如果'labels'太短，純粹的python實現實際上可能比使用NumPy更快。 – unutbu 2013-03-16 14:13:55

遵循unutbu的建議，我創建了一個純Python實現。

def entropy2(labels): 
""" Computes entropy of label distribution. """ 
    n_labels = len(labels) 

    if n_labels <= 1: 
     return 0 

    counts = np.bincount(labels) 
    probs = counts/n_labels 
    n_classes = np.count_nonzero(probs) 

    if n_classes <= 1: 
     return 0 

    ent = 0. 

    # Compute standard entropy. 
    for i in probs: 
     ent -= i * log(i, base=n_classes) 

    return ent

我缺少的一點是標籤是一個大數組，但是probs是3或4個元素長。現在使用純python我的應用程序的速度是現在的兩倍。

來源

2013-03-18 12:33:13 blueSurfer

應該將'base'設置爲類的數量嗎？我認爲自然日誌是標準（以及您在原始問題中使用的內容）。 – joemadeus 2016-06-17 21:22:36

看看這裏也有一個經典的香農熵，應該是快一點點再一個JohnEntropy http://pythonfiddle.com/shannon-entropy-calculation/

來源

2013-08-22 14:50:56 chupvl

以上答案是好的，但如果你需要，可以沿着不同的軸操作版本，這是一個工作實施。

def entropy(A, axis=None): 
    """Computes the Shannon entropy of the elements of A. Assumes A is 
    an array-like of nonnegative ints whose max value is approximately 
    the number of unique values present. 

    >>> a = [0, 1] 
    >>> entropy(a) 
    1.0 
    >>> A = np.c_[a, a] 
    >>> entropy(A) 
    1.0 
    >>> A     # doctest: +NORMALIZE_WHITESPACE 
    array([[0, 0], [1, 1]]) 
    >>> entropy(A, axis=0) # doctest: +NORMALIZE_WHITESPACE 
    array([ 1., 1.]) 
    >>> entropy(A, axis=1) # doctest: +NORMALIZE_WHITESPACE 
    array([[ 0.], [ 0.]]) 
    >>> entropy([0, 0, 0]) 
    0.0 
    >>> entropy([]) 
    0.0 
    >>> entropy([5]) 
    0.0 
    """ 
    if A is None or len(A) < 2: 
     return 0. 

    A = np.asarray(A) 

    if axis is None: 
     A = A.flatten() 
     counts = np.bincount(A) # needs small, non-negative ints 
     counts = counts[counts > 0] 
     if len(counts) == 1: 
      return 0. # avoid returning -0.0 to prevent weird doctests 
     probs = counts/float(A.size) 
     return -np.sum(probs * np.log2(probs)) 
    elif axis == 0: 
     entropies = map(lambda col: entropy(col), A.T) 
     return np.array(entropies) 
    elif axis == 1: 
     entropies = map(lambda row: entropy(row), A) 
     return np.array(entropies).reshape((-1, 1)) 
    else: 
     raise ValueError("unsupported axis: {}".format(axis))

來源

2015-08-27 23:45:39 Dave

不依賴於numpy的答案，或者：

import math 
from collections import Counter 

def eta(data, unit='natural'): 
    base = { 
     'shannon' : 2., 
     'natural' : math.exp(1), 
     'hartley' : 10. 
    } 

    if len(data) <= 1: 
     return 0 

    counts = Counter() 

    for d in data: 
     counts[d] += 1 

    probs = [float(c)/len(data) for c in counts.values()] 
    probs = [p for p in probs if p > 0.] 

    ent = 0 

    for p in probs: 
     if p > 0.: 
      ent -= p * math.log(p, base[unit]) 

    return ent

這將接受任何數據類型，你可以在它扔：

>>> eta(['mary', 'had', 'a', 'little', 'lamb']) 
1.6094379124341005 

>>> eta([c for c in "mary had a little lamb"]) 
2.311097886212714

來源

2016-06-17 21:44:09 joemadeus

import pandas as pd 
import scipy as sc 

# Input a pandas series 
def ent(data): 
    p_data= data.value_counts()/len(data) # calculates the probabilities 
    entropy=sc.stats.entropy(p_data) # input probabilities to get the entropy 
    return entropy

來源

2016-08-29 16:18:06

+10

儘管此代碼可能會回答問題，但爲何和/或代碼如何回答此問題提供了其他上下文，這可以提高其長期價值。不鼓勵使用僅有代碼的答案。 – Ajean 2016-08-29 23:14:52

請注意，scipy.stats.entropy已經使概率標準化：https：//docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html。或者，'value_counts'中有一個'normalise'選項：https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html – Nzbuu 2017-12-21 14:30:17

我最喜歡的功能對於熵是如下：

def entropy(labels): 
    prob_dict = {x:labels.count(x)/len(labels) for x in labels} 
    probs = np.array(list(prob_dict.values())) 

    return - probs.dot(np.log2(probs))

我仍在尋找更好的方法來避免字典 - >值 - >列表 - > np.array轉換。如果我發現它會再次發表評論。

來源

2017-02-17 10:23:01 Ottotos

不錯，使用collections.Counter會更好。 – NullPointer 2017-05-20 16:22:42

@Sanjeet Gupta答案很好，但可以濃縮。這個問題具體詢問「最快」的方式，但我只看到一個答案的時間，所以我會發表比較使用scipy和numpy的原始海報的entropy2答案略有改動。

四種不同的方法：SciPy的/ numpy的，numpy的/數學，大熊貓/ numpy的，numpy的

import numpy as np 
from scipy.stats import entropy 
from math import log, e 
import pandas as pd 

import timeit 

def entropy1(labels, base=None): 
    value,counts = np.unique(labels, return_counts=True) 
    return entropy(counts, base=base) 

def entropy2(labels, base=None): 
    """ Computes entropy of label distribution. """ 

    n_labels = len(labels) 

    if n_labels <= 1: 
    return 0 

    value,counts = np.unique(labels, return_counts=True) 
    probs = counts/n_labels 
    n_classes = np.count_nonzero(probs) 

    if n_classes <= 1: 
    return 0 

    ent = 0. 

    # Compute entropy 
    base = e if base is None else base 
    for i in probs: 
    ent -= i * log(i, base) 

    return ent 

def entropy3(labels, base=None): 
    vc = pd.Series(labels).value_counts(normalize=True, sort=False) 
    base = e if base is None else base 
    return -(vc * np.log(vc)/np.log(base)).sum() 

def entropy4(labels, base=None): 
    value,counts = np.unique(labels, return_counts=True) 
    norm_counts = counts/counts.sum() 
    base = e if base is None else base 
    return -(norm_counts * np.log(norm_counts)/np.log(base)).sum()

Timeit操作：

repeat_number = 1000000 

a = timeit.repeat(stmt='''entropy1(labels)''', 
        setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy1''', 
        repeat=3, number=repeat_number) 

b = timeit.repeat(stmt='''entropy2(labels)''', 
        setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy2''', 
        repeat=3, number=repeat_number) 

c = timeit.repeat(stmt='''entropy3(labels)''', 
        setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy3''', 
        repeat=3, number=repeat_number) 

d = timeit.repeat(stmt='''entropy4(labels)''', 
        setup='''labels=[1,3,5,2,3,5,3,2,1,3,4,5];from __main__ import entropy4''', 
        repeat=3, number=repeat_number)

Timeit結果：

# for loop to print out results of timeit 
for approach,timeit_results in zip(['scipy/numpy', 'numpy/math', 'pandas/numpy', 'numpy'], [a,b,c,d]): 
    print('Method: {}, Avg.: {:.6f}'.format(approach, np.array(timeit_results).mean())) 

Method: scipy/numpy, Avg.: 63.315312 
Method: numpy/math, Avg.: 49.256894 
Method: pandas/numpy, Avg.: 884.644023 
Method: numpy, Avg.: 60.026938

贏家：numpy的/數學（entropy2）

這也是值得注意的是，上述entropy2函數可以處理數字和文本數據。例如：entropy2(list('abcdefabacdebcab'))。原始海報的答案是從2013年開始的，並且具有分箱整數的特定用例，但不適用於文本。

來源

2017-07-13 22:37:46 Jarad

均勻分佈的數據（高熵）：

s=range(0,256)

香農熵計算步驟由步驟：

import collections 

# calculate probability for each byte as number of occurrences/array length 
probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()] 
# [0.00390625, 0.00390625, 0.00390625, ...] 

# calculate per-character entropy fractions 
e_x = [-p_x*math.log(p_x,2) for p_x in probabilities] 
# [0.03125, 0.03125, 0.03125, ...] 

# sum fractions to obtain Shannon entropy 
entropy = sum(e_x) 
>>> entropy 
8.0

一襯墊（假設import collections）：

def H(s): return sum([-p_x*math.log(p_x,2) for p_x in [n_x/len(s) for x,n_x in collections.Counter(s).items()]])

一個適當的功能：

import collections 

def H(s): 
    probabilities = [n_x/len(s) for x,n_x in collections.Counter(s).items()] 
    e_x = [-p_x*math.log(p_x,2) for p_x in probabilities]  
    return sum(e_x)

測試案例 - 從CyberChef entropy estimator採取英文文本：

>>> H(range(0,256)) 
8.0 
>>> H(range(0,64)) 
6.0 
>>> H(range(0,128)) 
7.0 
>>> H([0,1]) 
1.0 
>>> H('Standard English text usually falls somewhere between 3.5 and 5') 
4.228788210509104

來源

2017-11-17 10:24:09 kravietz

在Python中計算熵的最快方法

回答

相關問題