2014-01-08 52 views
3

我試圖比較sklearn.neighbors.KernelDensityscipy.stats.gaussian_kde之間二維數組的性能。sklearn中的2D KDE帶寬與scipy中的帶寬之間的關係

this article我看到每個函數中的帶寬(bw)都有不同的處理方式。本文給出了一個在scipy中設置正確bw的配方,因此它將與sklearn中使用的bw等效。基本上它將bw除以樣本標準偏差。結果是這樣的:

# For sklearn 
bw = 0.15 

# For scipy 
bw = 0.15/x.std(ddof=1) 

其中x是我使用以獲得KDE採樣陣列。這在1D中工作得很好,但我無法使它在2D中工作。

這裏是什麼,我得到了MWE

import numpy as np 
from scipy import stats 
from sklearn.neighbors import KernelDensity 

# Generate random data. 
n = 1000 
m1, m2 = np.random.normal(0.2, 0.2, size=n), np.random.normal(0.2, 0.2, size=n) 
# Define limits. 
xmin, xmax = min(m1), max(m1) 
ymin, ymax = min(m2), max(m2) 
# Format data. 
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j] 
positions = np.vstack([x.ravel(), y.ravel()]) 
values = np.vstack([m1, m2]) 

# Define some point to evaluate the KDEs. 
x1, y1 = 0.5, 0.5 

# ------------------------------------------------------- 
# Perform a kernel density estimate on the data using scipy. 
kernel = stats.gaussian_kde(values, bw_method=0.15/np.asarray(values).std(ddof=1)) 
# Get KDE value for the point. 
iso1 = kernel((x1,y1)) 
print 'iso1 = ', iso[0] 

# ------------------------------------------------------- 
# Perform a kernel density estimate on the data using sklearn. 
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=0.15).fit(zip(*values)) 
# Get KDE value for the point. 
iso2 = kernel_sk.score_samples([[x1, y1]]) 
print 'iso2 = ', np.exp(iso2[0]) 

iso2是作爲指數自sklearn返回日誌值)

結果我得到了iso1iso2是不同的,我我失去了如何影響帶寬(在任一函數中)使它們相等(如他們應該那樣)。


添加

我在sklearn聊天(由EP),我應該爲了獲得與sklearn比較的結果計算與scipy內核之前縮放(x,y)值建議過。

所以這是我做過什麼:

# Scale values. 
x_val_sca = np.asarray(values[0])/np.asarray(values).std(axis=1)[0] 
y_val_sca = np.asarray(values[1])/np.asarray(values).std(axis=1)[1] 
values = [x_val_sca, y_val_sca] 
kernel = stats.gaussian_kde(values, bw_method=bw_value) 

即:我得到內核scipy,同時保留其取得sklearn不變的內核前行同時縮放尺寸。

這給了更好的結果,但仍有獲得了籽粒差異:

kernels

其中的紅點是在代碼中(x1,y1)點。可以看出,密度估計的形狀仍然存在差異,雖然很小。也許這是最好的可以實現的?

回答

3

幾年後,我嘗試了這一點,並認爲我的工作不需要數據重新縮放。帶寬值確實需要一些縮放比例:

# For sklearn 
bw = 0.15 

# For scipy 
bw = 0.15/x.std(ddof=1) 

對於相同點的兩個KDE的評估並不完全相同。例如,以下是爲(x1, y1)點的評價:

iso1 = 0.00984751705005 # Scipy 
iso2 = 0.00989788224787 # Sklearn 

,但我想這是足夠接近。

下面是2D情況和它的輸出,就我所看到的,看起來幾乎一模一樣的MWE:

enter image description here

import numpy as np 
from scipy import stats 
from sklearn.neighbors import KernelDensity 
import matplotlib.pyplot as plt 
import matplotlib.gridspec as gridspec 

# Generate random data. 
n = 1000 
m1, m2 = np.random.normal(-3., 3., size=n), np.random.normal(-3., 3., size=n) 
# Define limits. 
xmin, xmax = min(m1), max(m1) 
ymin, ymax = min(m2), max(m2) 
ext_range = [xmin, xmax, ymin, ymax] 
# Format data. 
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j] 
positions = np.vstack([x.ravel(), y.ravel()]) 
values = np.vstack([m1, m2]) 

# Define some point to evaluate the KDEs. 
x1, y1 = 0.5, 0.5 
# Bandwidth value. 
bw = 0.15 

# ------------------------------------------------------- 
# Perform a kernel density estimate on the data using scipy. 
# **Bandwidth needs to be scaled to match Sklearn results** 
kernel = stats.gaussian_kde(
    values, bw_method=bw/np.asarray(values).std(ddof=1)) 
# Get KDE value for the point. 
iso1 = kernel((x1, y1)) 
print 'iso1 = ', iso1[0] 

# ------------------------------------------------------- 
# Perform a kernel density estimate on the data using sklearn. 
kernel_sk = KernelDensity(kernel='gaussian', bandwidth=bw).fit(zip(*values)) 
# Get KDE value for the point. Use exponential since sklearn returns the 
# log values 
iso2 = np.exp(kernel_sk.score_samples([[x1, y1]])) 
print 'iso2 = ', iso2[0] 


# Plot 
fig = plt.figure(figsize=(10, 10)) 
gs = gridspec.GridSpec(1, 2) 

# Scipy 
plt.subplot(gs[0]) 
plt.title("Scipy", x=0.5, y=0.92, fontsize=10) 
# Evaluate kernel in grid positions. 
k_pos = kernel(positions) 
kde = np.reshape(k_pos.T, x.shape) 
plt.imshow(np.rot90(kde), cmap=plt.cm.YlOrBr, extent=ext_range) 
plt.contour(x, y, kde, 5, colors='k', linewidths=0.6) 

# Sklearn 
plt.subplot(gs[1]) 
plt.title("Sklearn", x=0.5, y=0.92, fontsize=10) 
# Evaluate kernel in grid positions. 
k_pos2 = np.exp(kernel_sk.score_samples(zip(*positions))) 
kde2 = np.reshape(k_pos2.T, x.shape) 
plt.imshow(np.rot90(kde2), cmap=plt.cm.YlOrBr, extent=ext_range) 
plt.contour(x, y, kde2, 5, colors='k', linewidths=0.6) 

fig.tight_layout() 
plt.savefig('KDEs', dpi=300, bbox_inches='tight')