識別從1D異常值的塊和2D數據在Python

數據：我有一個數據d在一列而變化爲其他兩個變量的函數，一個和b，在其他兩列中定義。我的目標是在d中識別塊或異常值。這些異常值可能不是異常值，但對於我的情況，我想確定那些不符合可用線性擬合的數據雲的數據。識別從1D異常值的塊和2D數據在Python

問題：即使我以前從未做過聚類分析，名字聽起來像是它可以實現我想要做的。在情況下，我選擇了做聚類分析，我想這樣做，針對兩種情況如下：

與一個和d
與一個，b and d

我做了一些搜索並找到了＃1，使用KernelDensity模塊會更合適，而對於＃2使用MeahShift模塊在Python中都是不錯的選擇。

問題：我從來沒有做過聚類分析之前，所以我不明白在他們給出的文檔都KernelDensity和MeahShift的例子（here和here，分別）。是否有人可以解釋如何使用KernelDensity和MeahShift來識別案例1和案例2中d中異常值的「塊」？

來源

2015-07-09 Pupil

我覺得你首先需要一個強大的迴歸，因爲您的數據已經被一些異常值已被污染。一旦穩健的迴歸擬合，那麼在每個點計算的均方誤差可以用作聚類中心的距離度量（迴歸線）。大MSE的觀察可能是異常值。 –

sklearn中的強健迴歸參考鏈接。 http://scikit-learn.org/stable/modules/linear_model.html#robustness-regression-outliers-and-modeling-errors –

@JanxunLi：我很抱歉，但我無法理解該參考文獻中給出的示例。。你能舉一個簡單的例子嗎？ – Pupil

首先，KernelDensity是用於非參數方法。由於您堅信關係是線性的（即參數化模型），因此KernelDensity不是此任務中最合適的選擇。

下面是識別異常值的示例代碼。

import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.linear_model import RANSACRegressor 


# data: 1000 obs, 100 of them are outliers 
# ===================================================== 
np.random.seed(0) 
a = np.random.randn(1000) 
b = np.random.randn(1000) 
d = 2 * a - b + np.random.randn(1000) 
# the last 100 are outliers 
d[-100:] = d[-100:] + 10 * np.abs(np.random.randn(100)) 

fig, axes = plt.subplots(ncols=2, sharey=True) 
axes[0].scatter(a, d, c='g') 
axes[0].set_xlabel('a') 
axes[0].set_ylabel('d') 
axes[1].scatter(b, d, c='g') 
axes[1].set_xlabel('b')

enter image description here

# processing 
# ===================================================== 
# robust regression 
robust_estimator = RANSACRegressor(random_state=0) 
robust_estimator.fit(np.vstack([a,b]).T, d) 
d_pred = robust_estimator.predict(np.vstack([a,b]).T) 

# calculate mse 
mse = (d - d_pred.ravel()) ** 2 

# get 50 largest mse, 50 is just an arbitrary choice and it doesn't assume that we already know there are 100 outliers 
index = argsort(mse) 
fig, axes = plt.subplots(ncols=2, sharey=True) 
axes[0].scatter(a[index[:-50]], d[index[:-50]], c='b', label='inliers') 
axes[0].scatter(a[index[-50:]], d[index[-50:]], c='r', label='outliers') 
axes[0].set_xlabel('a') 
axes[0].set_ylabel('d') 
axes[0].legend(loc='best') 
axes[1].scatter(b[index[:-50]], d[index[:-50]], c='b', label='inliers') 
axes[1].scatter(b[index[-50:]], d[index[-50:]], c='r', label='outliers') 
axes[1].legend(loc='best') 
axes[1].set_xlabel('b')

enter image description here

爲您的樣品數據

import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.linear_model import RANSACRegressor 

df = pd.read_excel('/home/Jian/Downloads/Data.xlsx').dropna() 

a = df.a.values.reshape(len(df), 1) 
d = df.d.values.reshape(len(df), 1) 

fig, axes = plt.subplots(ncols=2, sharey=True) 
axes[0].scatter(a, d, c='g') 
axes[0].set_xlabel('a') 
axes[0].set_ylabel('d') 

robust_estimator = RANSACRegressor(random_state=0) 
robust_estimator.fit(a, d) 
d_pred = robust_estimator.predict(a) 

# calculate mse 
mse = (d - d_pred) ** 2 

index = np.argsort(mse.ravel()) 

axes[1].scatter(a[index[:-50]], d[index[:-50]], c='b', label='inliers', alpha=0.2) 
axes[1].scatter(a[index[-50:]], d[index[-50:]], c='r', label='outliers') 
axes[1].set_xlabel('a') 
axes[1].legend(loc=2)

來源

2015-07-09 22:48:41

@Pupil我更新了代碼。請看一看。 –

根據我在代碼中可以理解的內容，你是-1）以你已經知道異常值的前提開始你的代碼，2）對包括異常值在內的所有數據進行迴歸擬合。另外，你在穩健迴歸中使用的參數是什麼？然而，我的目標是1）首先破譯這些異常值塊，2）僅對位於異常值下的數據雲進行線性迴歸。 – Pupil

@Pupil不，我不假設任何關於異常值的知識。下半年的所有代碼都不會假設它知道異常值是最後100個obs。上面的代碼演示瞭如何去除異常值。如果您願意，您可以使用剩餘的內部人重新進行線性迴歸。 '.T'只是轉置運算符，請確保每列都是一個特徵。 –

識別從1D異常值的塊和2D數據在Python

回答

相關問題