2017-01-07 37 views
1

我正在嘗試找到具有標準差的秒外離羣點。我有兩個數據框如下。我試圖找到的異常值與周平均值相差1.5個標準差?當前代碼低於數據框。找到數據的離羣點

DF1:

name dateTime    Seconds 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
john 2015-01-02 13:13:13 12345.0101 
joe  2015-02-04 12:12:12 54321.0202 
joe  2015-01-02 13:13:13 12345.0101 

電流輸出:DF2

name day standardDev  mean   count 
Joe mon 22326.502700  40900.730647 1886 
     tue 9687.486726  51166.213836 159 
john mon 10072.707891  41380.035108 883 
     tue 5499.475345  26985.938776 196 

預期輸出:

DF2

name day standardDev  mean   count  events 
Joe mon 22326.502700  40900.730647 1886  [2015-02-04 12:12:12, 2015-02-04 12:12:13] 
     tue 9687.486726  51166.213836 159  [2015-02-04 12:12:12, 2015-02-04 12:12:14] 
john mon 10072.707891  41380.035108 883  [2015-01-02 13:13:13, 2015-01-02 13:13:15] 
     tue 5499.475345  26985.938776 196  [2015-01-02 13:13:13, 2015-01-02 13:13:18] 

CODE:

allFiles = glob.glob(folderPath + "/*.csv") 
list_ = [] 
for file_ in allFiles: 
    df = pd.read_csv(file_, index_col=None, names=['EventTime', "IpAddress", "Hostname", "TargetUserName", "AuthenticationPackageName", "TargetDomainName", "EventReceivedTime"]) 
    df = df.ix[1:] 
    list_.append(df) 
df = pd.concat(list_) 
df['DateTime'] = pd.to_datetime(df['EventTime']) 
df['day_of_week'] = df.DateTime.dt.strftime('%a') 
df['seconds'] = pd.to_timedelta(df.DateTime.dt.time.astype(str)).dt.seconds 
print(df.groupby((['TargetUserName', 'day_of_week'])).agg({'seconds': {'mean': lambda x: (x.mean()), 'std': lambda x: (np.std(x)), 'count': 'count'}})) 
+0

也許'DF1 [df1.groupby(pd.DatetimeIndex(df.dateTime).dayofweek)[ '秒']應用(拉姆達×:X>(1.5 * x.std()+ x.mean ()))]'? – Abdou

+0

你究竟意味着什麼「我不確定如何達到預期的產出」。 – Amjad

+0

我想弄清楚如何添加事件列並追蹤1.5個標準偏差距離均值上下的所有事件?理想情況下,我想添加具有完整數據的任何行,這是在事件列的時間段之外作爲事件列表。 – johnnyb

回答

1

這是從pandas docs輕微改編。我沒有創建意思爲& std的列,但是如果你想查看它,你可以很容易地添加它。

np.random.seed(1111) 
df=pd.DataFrame({ 'name':  ['joe','john']*30, 
        'dateTime': pd.date_range('1-1-2015',periods=60), 
        'Seconds': np.random.randn(60)+5000. }) 

grp = df.groupby(['name',df.dateTime.dt.dayofweek])['Seconds'] 
df['zscore'] = grp.transform(lambda x: (x-x.mean())/x.std()) 

df[ df['zscore'].abs() > 1.5 ] 
Out[79]: 
     Seconds dateTime name zscore 
1 4998.927011 2015-01-02 john -1.522488 
42 5001.275866 2015-02-12 joe 1.636829 
58 4999.124550 2015-02-28 joe -1.624945 

df.head(10) 
Out[80]: 
     Seconds dateTime name zscore 
0 4998.699990 2015-01-01 joe -0.959960 
1 4998.927011 2015-01-02 john -1.522488 
2 5000.790199 2015-01-03 joe 0.263690 
3 4999.121735 2015-01-04 john -1.005137 
4 5001.501822 2015-01-05 joe 1.132407 
5 4999.976071 2015-01-06 john 0.678951 
6 5000.275949 2015-01-07 joe 0.650297 
7 4999.033607 2015-01-08 john -0.964222 
8 4998.419685 2015-01-09 joe -1.328744 
9 4999.796325 2015-01-10 john 1.224198 
+0

是計算zscore對於該用戶每週的每一天的每個用戶?我試圖根據他們的時間模式找出一週中特定日子的1.5以內的人。 – johnnyb

+1

是的。你可以像這樣檢查一個特定的人/星期幾:'df [(df.dateTime.dt.dayofweek == 1)&(df.name =='joe')]'並且如果有使其更加清晰。 – JohnE