2012-05-03 77 views
10

我需要繪製一個圖表,用於繪圖上高密度區域的密度圖,但低於某個閾值時會使用單個點。我找不到任何現有的代碼,看起來類似於我在matplotlib縮略圖庫或谷歌搜索中需要的代碼。我有一個自己寫的工作代碼,但是當點數/點數很大時,它有點棘手,並且(更重要的是)需要很長的時間,這是無法接受的。下面是代碼:高效地創建高密度區域的密度圖,稀疏區域的點

import numpy as np 
import math 
import matplotlib as mpl 
import matplotlib.pyplot as plt 
import pylab 
import numpy.random 

#Create the colormap: 
halfpurples = {'blue': [(0.0,1.0,1.0),(0.000001, 0.78431373834609985, 0.78431373834609985), 
(0.25, 0.729411780834198, 0.729411780834198), (0.5, 
0.63921570777893066, 0.63921570777893066), (0.75, 
0.56078433990478516, 0.56078433990478516), (1.0, 0.49019607901573181, 
0.49019607901573181)], 

    'green': [(0.0,1.0,1.0),(0.000001, 
    0.60392159223556519, 0.60392159223556519), (0.25, 
    0.49019607901573181, 0.49019607901573181), (0.5, 
    0.31764706969261169, 0.31764706969261169), (0.75, 
    0.15294118225574493, 0.15294118225574493), (1.0, 0.0, 0.0)], 

    'red': [(0.0,1.0,1.0),(0.000001, 
    0.61960786581039429, 0.61960786581039429), (0.25, 
    0.50196081399917603, 0.50196081399917603), (0.5, 
    0.41568627953529358, 0.41568627953529358), (0.75, 
    0.32941177487373352, 0.32941177487373352), (1.0, 
    0.24705882370471954, 0.24705882370471954)]} 

halfpurplecmap = mpl.colors.LinearSegmentedColormap('halfpurples',halfpurples,256) 

#Create x,y arrays of normally distributed points 
npts = 1000 
x = numpy.random.standard_normal(npts) 
y = numpy.random.standard_normal(npts) 

#Set bin numbers in both axes 
nxbins = 25 
nybins = 25 

#Set the cutoff for resolving the individual points 
minperbin = 1 

#Make the density histrogram 
H, yedges, xedges = np.histogram2d(y,x,bins=(nybins,nxbins)) 
#Reorient the axes 
H = H[::-1] 

extent = [xedges[0],xedges[-1],yedges[0],yedges[-1]] 

#Compute all bins where the density plot value is below (or equal to) the threshold 
lowxleftedges = [[xedges[i] for j in range(len(H[:,i])) if H[j,i] <= minperbin] for i in range(len(H[0,:]))] 
lowxrightedges = [[xedges[i+1] for j in range(len(H[:,i])) if H[j,i] <= minperbin] for i in range(len(H[0,:]))] 
lowyleftedges = [[yedges[-(j+2)] for j in range(len(H[:,i])) if H[j,i] <= minperbin] for i in range(len(H[0,:]))] 
lowyrightedges = [[yedges[-(j+1)] for j in range(len(H[:,i])) if H[j,i] <= minperbin] for i in range(len(H[0,:]))] 

#Flatten and convert to numpy array 
lowxleftedges = np.asarray([item for sublist in lowxleftedges for item in sublist]) 
lowxrightedges = np.asarray([item for sublist in lowxrightedges for item in sublist]) 
lowyleftedges = np.asarray([item for sublist in lowyleftedges for item in sublist]) 
lowyrightedges = np.asarray([item for sublist in lowyrightedges for item in sublist]) 

#Find all points that lie in these regions 
lowdatax = [[x[i] for j in range(len(lowxleftedges)) if lowxleftedges[j] <= x[i] and x[i] <= lowxrightedges[j] and lowyleftedges[j] <= y[i] and y[i] <= lowyrightedges[j]] for i in range(len(x))] 
lowdatay = [[y[i] for j in range(len(lowyleftedges)) if lowxleftedges[j] <= x[i] and x[i] <= lowxrightedges[j] and lowyleftedges[j] <= y[i] and y[i] <= lowyrightedges[j]] for i in range(len(y))] 

#Flatten and convert into numpy array 
lowdatax = np.asarray([item for sublist in lowdatax for item in sublist]) 
lowdatay = np.asarray([item for sublist in lowdatay for item in sublist]) 

#Plot 
fig1 = plt.figure() 
ax1 = fig1.add_subplot(111) 
ax1.plot(lowdatax,lowdatay,linestyle='.',marker='o',mfc='k',mec='k') 
cp1 = ax1.imshow(H,interpolation='nearest',extent=extent,cmap=halfpurplecmap,vmin=minperbin) 
fig1.colorbar(cp1) 

fig1.savefig('contourtest.eps') 

此代碼生成看起來像這樣的圖像:

countour test

然而,在較大的數據使用時,將設置程序需要幾秒鐘到幾分鐘。任何想法如何加快這一點?謝謝!

+0

前幾天我的女朋友給我看她有R的['smoothScatter'(http://rfunction.com/archives/595)功能,這有利地結合了製作漂亮的繪圖散點圖和密度圖。我馬上感到沮喪,因爲在matplotlib中沒有等價物,所以我很高興在這裏找到這個老問題。 – Julien

回答

13

這應做到:

import matplotlib.pyplot as plt, numpy as np, numpy.random, scipy 

#histogram definition 
xyrange = [[-5,5],[-5,5]] # data range 
bins = [100,100] # number of bins 
thresh = 3 #density threshold 

#data definition 
N = 1e5; 
xdat, ydat = np.random.normal(size=N), np.random.normal(1, 0.6, size=N) 

# histogram the data 
hh, locx, locy = scipy.histogram2d(xdat, ydat, range=xyrange, bins=bins) 
posx = np.digitize(xdat, locx) 
posy = np.digitize(ydat, locy) 

#select points within the histogram 
ind = (posx > 0) & (posx <= bins[0]) & (posy > 0) & (posy <= bins[1]) 
hhsub = hh[posx[ind] - 1, posy[ind] - 1] # values of the histogram where the points are 
xdat1 = xdat[ind][hhsub < thresh] # low density points 
ydat1 = ydat[ind][hhsub < thresh] 
hh[hh < thresh] = np.nan # fill the areas with low density by NaNs 

plt.imshow(np.flipud(hh.T),cmap='jet',extent=np.array(xyrange).flatten(), interpolation='none', origin='upper') 
plt.colorbar() 
plt.plot(xdat1, ydat1, '.',color='darkblue') 
plt.show() 

image

+0

不錯,這與我最終的解決方案是一樣的想法,但用較少的代碼行表示。謝謝! – Singularity

+0

有沒有辦法做同樣的事情,但動態的情節重新縮放?例如,如果標準偏差非常不同, – chiffa

+0

'np.histogram2d'也可以,不需要導入'scipy' – Mathias711

2

您的問題是二次的 - 對於npts = 1000,您的數組大小達到10^6個點,並且您使用列表解析遍歷這些列表。
現在,這是一個當然的味道問題,但我發現列表理解可以產生一個難以遵循的完全代碼,並且它們有時只是稍微快一點......但這不是我的觀點。
我的觀點是,對於大數組操作你有一個像numpy的功能:

np.where, np.choose etc. 

見,你可以實現與NumPy的列表內涵的這個功能,你的代碼應該跑得更快。

我理解正確嗎?您的評論?

#Find all points that lie in these regions 

你在測試多邊形內的點嗎?如果是這樣,請考慮在matplotlib中使用point in polygon

1

經過一個晚上睡覺,並通過Oz123的建議閱讀,我想出了。訣竅是計算每個x,y點落入(xi,yi)的哪個bin,然後測試H [xi,yi](實際上,在我的情況下是H [yi,xi])是否低於閾值。代碼如下,並且運行速度非常快的大數目的點,是更清潔:

import numpy as np 
import math 
import matplotlib as mpl 
import matplotlib.pyplot as plt 
import pylab 
import numpy.random 

#Create the colormap: 
halfpurples = {'blue': [(0.0,1.0,1.0),(0.000001, 0.78431373834609985, 0.78431373834609985), 
0.25, 0.729411780834198, 0.729411780834198), (0.5, 
0.63921570777893066, 0.63921570777893066), (0.75, 
0.56078433990478516, 0.56078433990478516), (1.0, 0.49019607901573181, 
0.49019607901573181)], 

    'green': [(0.0,1.0,1.0),(0.000001, 
    0.60392159223556519, 0.60392159223556519), (0.25, 
    0.49019607901573181, 0.49019607901573181), (0.5, 
    0.31764706969261169, 0.31764706969261169), (0.75, 
    0.15294118225574493, 0.15294118225574493), (1.0, 0.0, 0.0)], 

    'red': [(0.0,1.0,1.0),(0.000001, 
    0.61960786581039429, 0.61960786581039429), (0.25, 
    0.50196081399917603, 0.50196081399917603), (0.5, 
    0.41568627953529358, 0.41568627953529358), (0.75, 
    0.32941177487373352, 0.32941177487373352), (1.0, 
    0.24705882370471954, 0.24705882370471954)]} 

halfpurplecmap = mpl.colors.LinearSegmentedColormap('halfpurples',halfpurples,256) 

#Create x,y arrays of normally distributed points 
npts = 100000 
x = numpy.random.standard_normal(npts) 
y = numpy.random.standard_normal(npts) 

#Set bin numbers in both axes 
nxbins = 100 
nybins = 100 

#Set the cutoff for resolving the individual points 
minperbin = 1 

#Make the density histrogram 
H, yedges, xedges = np.histogram2d(y,x,bins=(nybins,nxbins)) 
#Reorient the axes 
H = H[::-1] 

extent = [xedges[0],xedges[-1],yedges[0],yedges[-1]] 

#Figure out which bin each x,y point is in 
xbinsize = xedges[1]-xedges[0] 
ybinsize = yedges[1]-yedges[0] 
xi = ((x-xedges[0])/xbinsize).astype(np.integer) 
yi = nybins-1-((y-yedges[0])/ybinsize).astype(np.integer) 

#Subtract one from any points exactly on the right and upper edges of the region 
xim1 = xi-1 
yim1 = yi-1 
xi = np.where(xi < nxbins,xi,xim1) 
yi = np.where(yi < nybins,yi,yim1) 

#Get all points with density below the threshold 
lowdensityx = x[H[yi,xi] <= minperbin] 
lowdensityy = y[H[yi,xi] <= minperbin] 

#Plot 
fig1 = plt.figure() 
ax1 = fig1.add_subplot(111) 
ax1.plot(lowdensityx,lowdensityy,linestyle='.',marker='o',mfc='k',mec='k',ms=3) 
cp1 = ax1.imshow(H,interpolation='nearest',extent=extent,cmap=halfpurplecmap,vmin=minperbin) 
fig1.colorbar(cp1) 

fig1.savefig('contourtest.eps') 
+0

我給了你一個實現我的建議upvote :-)嘗試總是與numpy builtins一起工作,它比列表解析更快 – Oz123

4

爲了記錄在案,這裏是使用scipy.stats.gaussian_kde而不是2D直方圖的新嘗試的結果。 根據目的,可以設想不同的顏色齧合和輪廓組合。

import numpy as np 
from matplotlib import pyplot as plt 
from scipy.stats import gaussian_kde 

# parameters 
npts = 5000   # number of sample points 
bins = 100   # number of bins in density maps 
threshold = 0.01 # density threshold for scatter plot 

# initialize figure 
fig, ax = plt.subplots() 

# create a random dataset 
x1, y1 = np.random.multivariate_normal([0, 0], [[1, 0], [0, 1]], npts/2).T 
x2, y2 = np.random.multivariate_normal([4, 4], [[4, 0], [0, 1]], npts/2).T 
x = np.hstack((x1, x2)) 
y = np.hstack((y1, y2)) 
points = np.vstack([x, y]) 

# perform kernel density estimate 
kde = gaussian_kde(points) 
z = kde(points) 

# mask points above density threshold 
x = np.ma.masked_where(z > threshold, x) 
y = np.ma.masked_where(z > threshold, y) 

# plot unmasked points 
ax.scatter(x, y, c='black', marker='.') 

# get bounds from axes 
xmin, xmax = ax.get_xlim() 
ymin, ymax = ax.get_ylim() 

# prepare grid for density map 
xedges = np.linspace(xmin, xmax, bins) 
yedges = np.linspace(ymin, ymax, bins) 
xx, yy = np.meshgrid(xedges, yedges) 
gridpoints = np.array([xx.ravel(), yy.ravel()]) 

# compute density map 
zz = np.reshape(kde(gridpoints), xx.shape) 

# plot density map 
im = ax.imshow(zz, cmap='CMRmap_r', interpolation='nearest', 
       origin='lower', extent=[xmin, xmax, ymin, ymax]) 

# plot threshold contour 
cs = ax.contour(xx, yy, zz, levels=[threshold], colors='black') 

# show 
fig.colorbar(im) 
plt.show() 

Smooth scatter plot