Scipy提前kmeans羣集美白

我想羣集一些數據使用kmeans從sklearn.cluster。但我想首先美白我的數據。我有以下三列的熊貓DF（有幾百行）：Scipy提前kmeans羣集美白

1) zipcode 
2) highclust 
3) callclust

我想（用scipy.cluster.vq.whiten）變白。從我到目前爲止研究的結果來看，df色譜柱在美白前必須轉換爲矩陣。所以我做：

features = df.as_matrix(columns = ['highclust', 'callclust'])

然後我用whiten(features)。

工作正常，但現在我想將這些值返回到原來的df。

問題是我沒有合併它的值。如果我在創建時將zipcode納入features，郵政編碼會與highclust和callclust一起變白，使其無法使用。

來源

2015-12-19 dstar

您_could_只需要創建一個新的DF並將結果返回存儲在那裏，而不是強迫他們到舊框框？ –

我不知道如何辛辣的美白工程，但也許你可以做一些像'df ['highclust'] = whitened [：，0]'？不是一個漂亮的解決方案，但它可能滿足您的需求。 – iled

最簡單的解決方案是首先保存郵政編碼，美白，然後重新應用郵政編碼。

from scipy.cluster.vq import whiten 
import pandas as pd 

zips = df.zipcode 
df = pd.DataFrame(whiten(df), columns=df.columns) 
df['zipcode'] = zips

你也可以自己做calc，而不是使用scipy來使用lambda函數。

np.random.seed(0) 

whiten_cols = ['highclust', 'callclust'] 
df = pd.DataFrame({'zipcode': [1, 2, 3, 4, 5], 
        'highclust': np.random.randn(5), 
        'callclust': np.random.randn(5)})[['zipcode'] + whitencols] 

>>> df 
    zipcode highclust callclust 
0  1 1.764052 -0.977278 
1  2 0.400157 0.950088 
2  3 0.978738 -0.151357 
3  4 2.240893 -0.103219 
4  5 1.867558 0.410599  

>>> df.std() 
zipcode  1.581139 
highclust 0.745445 
callclust 0.717038 
dtype: float64  

# Whiten data. 
df.loc[:, whiten_cols] = df[whiten_cols].apply(lambda col: col/col.std()) 

>>> df 
    zipcode highclust callclust 
0  1 2.366442 -1.362937 
1  2 0.536803 1.325018 
2  3 1.312958 -0.211087 
3  4 3.006115 -0.143952 
4  5 2.505293 0.572631 

>>> df.std() 
zipcode  1.581139 
highclust 1.000000 
callclust 1.000000 
dtype: float64

大熊貓默認標準偏差爲N-1。這會不會是一個大的數據集的問題，但你可以匹配的結果SciPy的：

df.loc[:, whiten_cols] = df[whiten_cols].apply(lambda col: col/col.std(ddof=0)) 

>>> df 
    zipcode highclust callclust 
0  1 2.645763 -1.523810 
1  2 0.600164 1.481415 
2  3 1.467932 -0.236002 
3  4 3.360938 -0.160943 
4  5 2.801003 0.640221

如果你更喜歡直接使用SciPy的：

# After resetting the seed and reinitializing the dataframe. 
df.loc[:, whiten_cols] = whiten(df[whiten_cols].values) 

>>> df 
    zipcode highclust callclust 
0  1 2.645763 -1.523810 
1  2 0.600164 1.481415 
2  3 1.467932 -0.236002 
3  4 3.360938 -0.160943 
4  5 2.801003 0.640221 

>>> df.std() 
zipcode  1.581139 
highclust 1.118034 
callclust 1.118034 
dtype: float64

scipy.cluster.vq.whiten

scipy.cluster.vq.whiten（obs，check_finite = True）[source] 根據每個要素標準化一組觀察值。

運行k-means之前，重新縮放每個特徵尺寸爲的白化觀察集是有益的。每個特徵除以所有觀測值的標準偏差，以給出其單位差異。

這是source code爲whiten：

obs = _asarray_validated(obs, check_finite=check_finite) 
std_dev = std(obs, axis=0) 
zero_std_mask = std_dev == 0 
if zero_std_mask.any(): 
    std_dev[zero_std_mask] = 1.0 
    warnings.warn("Some columns have standard deviation zero. " 
        "The values of these columns will not change.", 
        RuntimeWarning) 
return obs/std_dev

來源

2015-12-19 17:47:00 Alexander

Scipy提前kmeans羣集美白

回答

相關問題