我正在嘗試找到具有標準差的秒外離羣點。我有兩個數據框如下。我試圖找到的異常值與周平均值相差1.5個標準差?當前代碼低於數據框。找到數據的離羣點
DF1:
name dateTime Seconds
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
john 2015-01-02 13:13:13 12345.0101
joe 2015-02-04 12:12:12 54321.0202
joe 2015-01-02 13:13:13 12345.0101
電流輸出:DF2
name day standardDev mean count
Joe mon 22326.502700 40900.730647 1886
tue 9687.486726 51166.213836 159
john mon 10072.707891 41380.035108 883
tue 5499.475345 26985.938776 196
預期輸出:
DF2
name day standardDev mean count events
Joe mon 22326.502700 40900.730647 1886 [2015-02-04 12:12:12, 2015-02-04 12:12:13]
tue 9687.486726 51166.213836 159 [2015-02-04 12:12:12, 2015-02-04 12:12:14]
john mon 10072.707891 41380.035108 883 [2015-01-02 13:13:13, 2015-01-02 13:13:15]
tue 5499.475345 26985.938776 196 [2015-01-02 13:13:13, 2015-01-02 13:13:18]
CODE:
allFiles = glob.glob(folderPath + "/*.csv")
list_ = []
for file_ in allFiles:
df = pd.read_csv(file_, index_col=None, names=['EventTime', "IpAddress", "Hostname", "TargetUserName", "AuthenticationPackageName", "TargetDomainName", "EventReceivedTime"])
df = df.ix[1:]
list_.append(df)
df = pd.concat(list_)
df['DateTime'] = pd.to_datetime(df['EventTime'])
df['day_of_week'] = df.DateTime.dt.strftime('%a')
df['seconds'] = pd.to_timedelta(df.DateTime.dt.time.astype(str)).dt.seconds
print(df.groupby((['TargetUserName', 'day_of_week'])).agg({'seconds': {'mean': lambda x: (x.mean()), 'std': lambda x: (np.std(x)), 'count': 'count'}}))
也許'DF1 [df1.groupby(pd.DatetimeIndex(df.dateTime).dayofweek)[ '秒']應用(拉姆達×:X>(1.5 * x.std()+ x.mean ()))]'? – Abdou
你究竟意味着什麼「我不確定如何達到預期的產出」。 – Amjad
我想弄清楚如何添加事件列並追蹤1.5個標準偏差距離均值上下的所有事件?理想情況下,我想添加具有完整數據的任何行,這是在事件列的時間段之外作爲事件列表。 – johnnyb