2015-08-31 22 views
4

我有以下data如何使用Pearson相關作爲距離度量Scikit學習凝聚的聚集

State Murder Assault UrbanPop Rape 
Alabama 13.200 236 58 21.200 
Alaska 10.000 263 48 44.500 
Arizona 8.100 294 80 31.000 
Arkansas 8.800 190 50 19.500 
California 9.000 276 91 40.600 
Colorado 7.900 204 78 38.700 
Connecticut 3.300 110 77 11.100 
Delaware 5.900 238 72 15.800 
Florida 15.400 335 80 31.900 
Georgia 17.400 211 60 25.800 
Hawaii 5.300 46 83 20.200 
Idaho 2.600 120 54 14.200 
Illinois 10.400 249 83 24.000 
Indiana 7.200 113 65 21.000 
Iowa 2.200 56 57 11.300 
Kansas 6.000 115 66 18.000 
Kentucky 9.700 109 52 16.300 
Louisiana 15.400 249 66 22.200 
Maine 2.100 83 51 7.800 
Maryland 11.300 300 67 27.800 
Massachusetts 4.400 149 85 16.300 
Michigan 12.100 255 74 35.100 
Minnesota 2.700 72 66 14.900 
Mississippi 16.100 259 44 17.100 
Missouri 9.000 178 70 28.200 
Montana 6.000 109 53 16.400 
Nebraska 4.300 102 62 16.500 
Nevada 12.200 252 81 46.000 
New Hampshire 2.100 57 56 9.500 
New Jersey 7.400 159 89 18.800 
New Mexico 11.400 285 70 32.100 
New York 11.100 254 86 26.100 
North Carolina 13.000 337 45 16.100 
North Dakota 0.800 45 44 7.300 
Ohio 7.300 120 75 21.400 
Oklahoma 6.600 151 68 20.000 
Oregon 4.900 159 67 29.300 
Pennsylvania 6.300 106 72 14.900 
Rhode Island 3.400 174 87 8.300 
South Carolina 14.400 279 48 22.500 
South Dakota 3.800 86 45 12.800 
Tennessee 13.200 188 59 26.900 
Texas 12.700 201 80 25.500 
Utah 3.200 120 80 22.900 
Vermont 2.200 48 32 11.200 
Virginia 8.500 156 63 20.700 
Washington 4.000 145 73 26.200 
West Virginia 5.700 81 39 9.300 
Wisconsin 2.600 53 66 10.800 
Wyoming 6.800 161 60 15.600 

我使用來執行基於狀態的層次聚類。 這是一個完整的工作代碼:

import pandas as pd 
from sklearn.cluster import AgglomerativeClustering 
df = pd.io.parsers.read_table("http://dpaste.com/031VZPM.txt") 
samples = df["State"].tolist() 
ndf = df[["Murder", "Assault", "UrbanPop","Rape"]] 
X = ndf.as_matrix() 

cluster = AgglomerativeClustering(n_clusters=3, 
           linkage='complete',affinity='euclidean').fit(X) 
label = cluster.labels_ 
outclust = list(zip(label, samples)) 
outclust_df = pd.DataFrame(outclust,columns=["Clusters","Samples"]) 

for clust in outclust_df.groupby("Clusters"): 
    print (clust) 

請注意,在這個方法我用euclidean距離。我想要做的是使用1-Pearson correlation distance。在R看起來像這樣:

dat <- read.table("http://dpaste.com/031VZPM.txt",sep="\t",header=TRUE) 
dist2 = function(x) as.dist(1-cor(t(x), method="pearson")) 
dat = dat[c("Murder","Assault","UrbanPop","Rape")] 
hclust(dist2(dat), method="ward.D") 

我該如何使用Scikit-learn AgglomerativeClustering來實現? 我知道有關於親和力的「預先計算」的論點。但不知道如何使用它來解決我的問題。

+0

你想關聯人口和犯罪嗎? – achabacha322

回答

5

您可以自定義一個親和基質爲這需要在您的數據並返回親和基質的功能:

from scipy.stats import pearsonr 
import numpy as np 

def pearson_affinity(M): 
    return 1 - np.array([[pearsonr(a,b)[0] for a in M] for b in M]) 

然後就可以調用凝聚聚類以此爲親和力函數(你必須改變聯動,因爲「病房」只適用於歐氏距離

cluster = AgglomerativeClustering(n_clusters=3, linkage='average', 
          affinity=pearson_affinity) 
cluster.fit(X) 

注意,它似乎並沒有很好地工作爲您的數據由於某種原因:

cluster.labels_ 
Out[107]: 
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 
     0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
     0, 0, 1, 0]) 
+0

的代碼不適合我。 pearson_affinity(X)檢查失敗。順便說一句,爲什麼'函數中的'df.values'?它是否是一個np矩陣或熊貓數據框? – pdubois

+0

ok現在修正了代碼,並改爲一減,並且集羣稍好一點;)M是一個np.array – maxymoo