2014-02-27 73 views
4

我想要使用DBSCAN(scikit學習實現)和位置數據進行聚類。我的數據採用np數組格式,但要使用具有Haversine公式的DBSCAN,我需要創建一個距離矩陣。當我嘗試執行此操作時,出現以下錯誤(「模塊」不可調用錯誤。)從我在線閱讀的內容中可以看出,這是一個導入錯誤,但我非常確定這不是我的情況。我已經創建了我自己的半直線距離公式,但我確定這個錯誤不是這個。使用nparray與pdist和方形的距離矩陣創建

這是我的輸入數據,一個np數組(ResultArray)。

[[ 53.3252628 -6.2644198 ] 
[ 53.3287395 -6.2646543 ] 
[ 53.33321202 -6.24785807] 
[ 53.3261015 -6.2598324 ] 
[ 53.325291 -6.2644105 ] 
[ 53.3281323 -6.2661467 ] 
[ 53.3253074 -6.2644483 ] 
[ 53.3388147 -6.2338417 ] 
[ 53.3381102 -6.2343826 ] 
[ 53.3253074 -6.2644483 ] 
[ 53.3228188 -6.2625379 ] 
[ 53.3253074 -6.2644483 ]] 

這是代碼是示數行。

distance_matrix = sp.spatial.distance.squareform(sp.spatial.distance.pdist 
(ResultArray,(lambda u,v: haversine(u,v)))) 

這是錯誤消息:

File "Location.py", line 48, in <module> 
distance_matrix = sp.spatial.distance.squareform(sp.spatial.distance.pdist 
(ResArray,(lambda u,v: haversine(u,v)))) 
File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1118, in pdist 
dm[k] = dfun(X[i], X[j]) 
File "Location.py", line 48, in <lambda> 
distance_matrix = sp.spatial.distance.squareform(sp.spatial.distance.pdist 
(ResArray,(lambda u,v: haversine(u,v)))) 
TypeError: 'module' object is not callable 

我進口SciPy的作爲起點。 (導入scipy as sp)

+0

請注意,ELKI在DBSCAN中具有索引加速度,使用R * - 樹。這不需要O(n^2)時間和內存。它也有光學,就像DBSCAN 2.0 –

回答

4

只需scipypdist不允許傳入自定義距離函數。正如您可以在docs中看到的那樣,您可以選擇一些選項,但支持距離不在支持的度量標準列表中。

(Matlab的pdist不支持的選項,雖然,見here

你需要做的的計算「手動」,即與循環,像這樣將工作:

from numpy import array,zeros 

def haversine(lon1, lat1, lon2, lat2): 
    """ See the link below for a possible implementation """ 
    pass 

#example input (your's, truncated) 
ResultArray = array([[ 53.3252628, -6.2644198 ], 
        [ 53.3287395 , -6.2646543 ], 
        [ 53.33321202 , -6.24785807], 
        [ 53.3253074 , -6.2644483 ]]) 

N = ResultArray.shape[0] 
distance_matrix = zeros((N, N)) 
for i in xrange(N): 
    for j in xrange(N): 
     lati, loni = ResultArray[i] 
     latj, lonj = ResultArray[j] 
     distance_matrix[i, j] = haversine(loni, lati, lonj, latj) 
     distance_matrix[j, i] = distance_matrix[i, j] 

print distance_matrix 
[[ 0.   0.38666203 1.41010971 0.00530489] 
[ 0.38666203 0.   1.22043364 0.38163748] 
[ 1.41010971 1.22043364 0.   1.40848782] 
[ 0.00530489 0.38163748 1.40848782 0.  ]] 

僅供參考,Haverside的Python實現可以在here找到。

5

隨着SciPy的通過在該link文件的建議,並在此報道爲方便起見,你可以自定義一個距離函數:

Y = pdist(X, f) 
Computes the distance between all pairs of vectors in X using the user supplied 2-arity function f. For example, Euclidean distance between the vectors could be computed as follows: 

dm = pdist(X, lambda u, v: np.sqrt(((u-v)**2).sum())) 

在這裏我報告我的版本的代碼啓發的在代碼link

from numpy import sin,cos,arctan2,sqrt,pi # import from numpy 
# earth's mean radius = 6,371km 
EARTHRADIUS = 6371.0 

def getDistanceByHaversine(loc1, loc2): 
    '''Haversine formula - give coordinates as a 2D numpy array of 
    (lat_denter link description hereecimal,lon_decimal) pairs''' 
    #  
    # "unpack" our numpy array, this extracts column wise arrays 
    lat1 = loc1[1] 
    lon1 = loc1[0] 
    lat2 = loc2[1] 
    lon2 = loc2[0] 
    # 
    # convert to radians ##### Completely identical 
    lon1 = lon1 * pi/180.0 
    lon2 = lon2 * pi/180.0 
    lat1 = lat1 * pi/180.0 
    lat2 = lat2 * pi/180.0 
    # 
    # haversine formula #### Same, but atan2 named arctan2 in numpy 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = (sin(dlat/2))**2 + cos(lat1) * cos(lat2) * (sin(dlon/2.0))**2 
    c = 2.0 * arctan2(sqrt(a), sqrt(1.0-a)) 
    km = EARTHRADIUS * c 
    return km 

和調用方式如下:

D = spatial.distance.pdist(A, lambda u, v: getDistanceByHaversine(u,v)) 

我在執行矩陣A已經以十進制度表示爲第一列中的經度值和第二列中的緯度值。

0

現在,您可以使用scikit-learn的DBSCAN和海峽網格度量將空間緯度 - 經度數據聚類,而無需使用scipy預計算距離矩陣。

db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates)) 

這來源於此教程clustering spatial data with scikit-learn DBSCAN。請特別注意eps值爲2公里除以6371(地球半徑以公里爲單位)以將其轉換爲弧度。另請注意,.fit()以半徑單位表示半座標度量的座標。