找到數據的離羣點

我正在嘗試找到具有標準差的秒外離羣點。我有兩個數據框如下。我試圖找到的異常值與周平均值相差1.5個標準差？當前代碼低於數據框。找到數據的離羣點

DF1：

name dateTime    Seconds 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
joe  2015-01-02 13:13:13 12345.0101

電流輸出：DF2

name day standardDev  mean   count 
Joe mon 22326.502700  40900.730647 1886 
     tue 9687.486726  51166.213836 159 
john mon 10072.707891  41380.035108 883 
     tue 5499.475345  26985.938776 196

預期輸出：

DF2

name day standardDev  mean   count  events 
Joe mon 22326.502700  40900.730647 1886  [2015-02-04 12:12:12, 2015-02-04 12:12:13] 
     tue 9687.486726  51166.213836 159  [2015-02-04 12:12:12, 2015-02-04 12:12:14] 
john mon 10072.707891  41380.035108 883  [2015-01-02 13:13:13, 2015-01-02 13:13:15] 
     tue 5499.475345  26985.938776 196  [2015-01-02 13:13:13, 2015-01-02 13:13:18]

CODE：

allFiles = glob.glob(folderPath + "/*.csv") 
list_ = [] 
for file_ in allFiles: 
    df = pd.read_csv(file_, index_col=None, names=['EventTime', "IpAddress", "Hostname", "TargetUserName", "AuthenticationPackageName", "TargetDomainName", "EventReceivedTime"]) 
    df = df.ix[1:] 
    list_.append(df) 
df = pd.concat(list_) 
df['DateTime'] = pd.to_datetime(df['EventTime']) 
df['day_of_week'] = df.DateTime.dt.strftime('%a') 
df['seconds'] = pd.to_timedelta(df.DateTime.dt.time.astype(str)).dt.seconds 
print(df.groupby((['TargetUserName', 'day_of_week'])).agg({'seconds': {'mean': lambda x: (x.mean()), 'std': lambda x: (np.std(x)), 'count': 'count'}}))

來源

2017-01-07 johnnyb

也許'DF1 [df1.groupby（pd.DatetimeIndex（df.dateTime）.dayofweek）[ '秒']應用（拉姆達×：X>（1.5 * x.std（）+ x.mean （）））]'？ – Abdou

你究竟意味着什麼「我不確定如何達到預期的產出」。 – Amjad

我想弄清楚如何添加事件列並追蹤1.5個標準偏差距離均值上下的所有事件？理想情況下，我想添加具有完整數據的任何行，這是在事件列的時間段之外作爲事件列表。 – johnnyb

這是從pandas docs輕微改編。我沒有創建意思爲& std的列，但是如果你想查看它，你可以很容易地添加它。

np.random.seed(1111) 
df=pd.DataFrame({ 'name':  ['joe','john']*30, 
        'dateTime': pd.date_range('1-1-2015',periods=60), 
        'Seconds': np.random.randn(60)+5000. }) 

grp = df.groupby(['name',df.dateTime.dt.dayofweek])['Seconds'] 
df['zscore'] = grp.transform(lambda x: (x-x.mean())/x.std()) 

df[ df['zscore'].abs() > 1.5 ] 
Out[79]: 
     Seconds dateTime name zscore 
1 4998.927011 2015-01-02 john -1.522488 
42 5001.275866 2015-02-12 joe 1.636829 
58 4999.124550 2015-02-28 joe -1.624945 

df.head(10) 
Out[80]: 
     Seconds dateTime name zscore 
0 4998.699990 2015-01-01 joe -0.959960 
1 4998.927011 2015-01-02 john -1.522488 
2 5000.790199 2015-01-03 joe 0.263690 
3 4999.121735 2015-01-04 john -1.005137 
4 5001.501822 2015-01-05 joe 1.132407 
5 4999.976071 2015-01-06 john 0.678951 
6 5000.275949 2015-01-07 joe 0.650297 
7 4999.033607 2015-01-08 john -0.964222 
8 4998.419685 2015-01-09 joe -1.328744 
9 4999.796325 2015-01-10 john 1.224198

來源

2017-01-08 02:43:02 JohnE

是計算zscore對於該用戶每週的每一天的每個用戶？我試圖根據他們的時間模式找出一週中特定日子的1.5以內的人。 – johnnyb

是的。你可以像這樣檢查一個特定的人/星期幾：'df [（df.dateTime.dt.dayofweek == 1）＆（df.name =='joe'）]'並且如果有使其更加清晰。 – JohnE

找到數據的離羣點

回答

相關問題