2017-08-14 29 views
2

我有一個包含任務嘗試事件及其結果(失敗或成功)的數據庫。對於每個用戶,我想在第一次成功之前計算失敗次數。後續的失敗和成功不應該影響輸出 - 我只對給定任務的第一次成功感興趣。此外,數據庫包含其他應忽略的事件的行。T-SQL:計數第一次成功之前的失敗次數(2)

如何在Vertica數據庫的T-SQL中制定此操作?

(我希望最終計算出每個任務嘗試的平均數,但讓我們保持他們在這個問題讓事情變得可管理的範圍之內。)

這是這裏的問題的更新: T-SQL: Count number of failures until first success

在原始問題中,我提供的構造不好的示例數據並未完全反映我的使用場景,導致回答不適用於我的實際數據,而且我無法驗證。

該解決方案不應該依賴行順序 - 有可能不按時間戳順序填充行。

這裏的DB設置:

CREATE TABLE events { 
     eventID int -- unused in this example, should be excluded from output 
    , eventName varchar(256) 
    , userName varchar(256) 
    , timestamp timestamp 
    , taskName varchar(256) 
    , sessionID int -- unused in this example, should be excluded from output 
}; 

INSERT INTO events 
    VALUES 
     (2363460186192576512, 'beginSession', 'John', '2017-08-14 09:46:46.712', NULL, 145031357) 
     , (2363460852537008128, 'success', 'John', '2017-08-14 09:49:32.471', 'TaskOne', 145031357) 
     , (2363461162974437376, 'success', 'John', '2017-08-14 09:50:48.781', 'TaskOne', 145031357) 
     , (2363460390131740672, 'fail', 'John', '2017-08-14 09:47:37.349', 'TaskOne', 145031357) 
     , (2363460556662710272, 'fail', 'John', '2017-08-14 09:48:23.024', 'TaskOne', 145031357) 
     , (2363460730671505408, 'fail', 'John', '2017-08-14 09:48:58.646', 'TaskOne', 145031357) 
     , (2363461032111800320, 'fail', 'John', '2017-08-14 09:50:10.726', 'TaskOne', 145031357) 
     , (2363460389896859648, 'beginTask', 'John', '2017-08-14 09:47:05.32', 'TaskOne', 145031357) 
     , (2363460463137751040, 'beginTask', 'John', '2017-08-14 09:47:52.166', 'TaskOne', 145031357) 
     , (2363460556205531136, 'beginTask', 'John', '2017-08-14 09:48:12.615', 'TaskOne', 145031357) 
     , (2363460692671205376, 'beginTask', 'John', '2017-08-14 09:48:36.155', 'TaskOne', 145031357) 
     , (2363460852268572672, 'beginTask', 'John', '2017-08-14 09:49:12.047', 'TaskOne', 145031357) 
     , (2363460962524327936, 'beginTask', 'John', '2017-08-14 09:49:47.951', 'TaskOne', 145031357) 
     , (2363461162714390528, 'beginTask', 'John', '2017-08-14 09:50:23.645', 'TaskOne', 145031357) 
     , (2363474741421064192, 'beginSession', 'John', '2017-08-14 10:44:36.042', NULL, 145031392) 
     , (2363474885491200000, 'success', 'John', '2017-08-14 10:45:14.577', 'TaskTwo', 145031392) 
     , (2363475342389641216, 'success', 'John', '2017-08-14 10:47:04.098', 'TaskTwo', 145031392) 
     , (2363475473998635008, 'success', 'John', '2017-08-14 10:47:34.135', 'TaskOne', 145031392) 
     , (2363475822079254528, 'success', 'John', '2017-08-14 10:48:53.381', 'TaskTwo', 145031392) 
     , (2363476096949104640, 'success', 'John', '2017-08-14 10:50:07.441', 'TaskThree', 145031392) 
     , (2363475066098266112, 'fail', 'John', '2017-08-14 10:45:53.526', 'TaskTwo', 145031392) 
     , (2363475195152531456, 'fail', 'John', '2017-08-14 10:46:32.81', 'TaskTwo', 145031392) 
     , (2363475654638821376, 'fail', 'John', '2017-08-14 10:48:13.71', 'TaskThree', 145031392) 
     , (2363476247751114752, 'beginSession', 'Mike', '2017-08-14 10:50:37.67', NULL, 145030476) 
     , (2363476335819063296, 'success', 'Mike', '2017-08-14 10:51:06.841', 'TaskOne', 145030476) 
     , (2363476485643796480, 'success', 'Mike', '2017-08-14 10:51:41.086', 'TaskTwo', 145030476) 
     , (2363476806063038464, 'success', 'Mike', '2017-08-14 10:52:53.174', 'TaskTwo', 145030476) 
     , (2363477266119335936, 'success', 'Mike', '2017-08-14 10:54:32.053', 'TaskThree', 145030476) 
     , (2363477619191631872, 'success', 'Mike', '2017-08-14 10:56:01.783', 'TaskThree', 145030476) 
     , (2363476705131655168, 'fail', 'Mike', '2017-08-14 10:52:21.312', 'TaskThree', 145030476) 
     , (2363476939634896896, 'fail', 'Mike', '2017-08-14 10:53:28.906', 'TaskThree', 145030476) 
     , (2363477390937976832, 'fail', 'Mike', '2017-08-14 10:55:05.499', 'TaskThree', 145030476) 
     , (2363476335592570880, 'beginTask', 'Mike', '2017-08-14 10:50:50.074', 'TaskOne', 145030476) 
     , (2363476485501190144, 'beginTask', 'Mike', '2017-08-14 10:51:20.784', 'TaskTwo', 145030476) 
     , (2363476704779333632, 'beginTask', 'Mike', '2017-08-14 10:51:54.829', 'TaskThree', 145030476) 
     , (2363476805752659968, 'beginTask', 'Mike', '2017-08-14 10:52:34.001', 'TaskTwo', 145030476) 
     , (2363476939496484864, 'beginTask', 'Mike', '2017-08-14 10:53:06.468', 'TaskThree', 145030476) 
     , (2363477265938980864, 'beginTask', 'Mike', '2017-08-14 10:53:45.631', 'TaskThree', 145030476) 
     , (2363477390635986944, 'beginTask', 'Mike', '2017-08-14 10:54:44.706', 'TaskThree', 145030476) 
     , (2363477573427560448, 'beginTask', 'Mike', '2017-08-14 10:55:17.231', 'TaskThree', 145030476) 
     , (2363474885214375936, 'beginTask', 'John', '2017-08-14 10:44:44.702', 'TaskTwo', 145031392) 
     , (2363474985177161728, 'beginTask', 'John', '2017-08-14 10:45:31.133', 'TaskTwo', 145031392) 
     , (2363475195014119424, 'beginTask', 'John', '2017-08-14 10:46:10.098', 'TaskTwo', 145031392) 
     , (2363475342184120320, 'beginTask', 'John', '2017-08-14 10:46:45.357', 'TaskTwo', 145031392) 
     , (2363475473616953344, 'beginTask', 'John', '2017-08-14 10:47:17.911', 'TaskOne', 145031392) 
     , (2363475654437494784, 'beginTask', 'John', '2017-08-14 10:47:47.681', 'TaskThree', 145031392) 
     , (2363475771776864256, 'beginTask', 'John', '2017-08-14 10:48:27.1', 'TaskTwo', 145031392) 
     , (2363476006456762368, 'beginTask', 'John', '2017-08-14 10:49:06.151', 'TaskThree', 145031392) 
    ; 

有了這些數據,這裏是我想要達到的效果:

userName taskName numFailuresBeforeFirstSuccess 
John  TaskOne 3 
John  TaskTwo 0 
John  TaskThree 1 
Mike  TaskOne 0 
Mike  TaskTwo 0 
Mike  TaskThree 3 
+0

根據你的數據,不應該在'TaskOne'首次成功之前有4次失敗? –

+0

@RodrickChapman你是對的,那是一個錯字。編輯。 –

+0

瀏覽你的數據,這是我所看到的,爲了。 約翰 - 任務1 = 3失敗,1成功,1失敗,1成功,1成功。任務2 = 1成功,2失敗,1成功,1成功。任務3 = 1失敗,1成功。 邁克 - 任務1 = 1成功,任務2 = 1失敗,1成功,1失敗,1成功,任務3 = 2失敗,1成功,1失敗,1成功。 – Shawn

回答

1

下面是一個方法:

select e.username, e.taskname, 
     sum(case when timestamp < first_success_ts and e.eventname = 'fail' then 1 else 0 end) as numFailuresBeforeSuccess 
from (select e.*, 
      min(case when e.eventname = 'success' then e.timestamp end) over 
       (partition by e.username, e.taskname) as first_success_ts 
     from events e 
    ) e 
group by e.username, e.taskname 
order by e.username, e.taskname; 

這將使用窗口函數計算第一個成功時間。這應該在兩個數據庫中工作(如至少在SQL Server 2012+)

+0

嗯......當我執行這個查詢時,所有的結果都是0。內部選擇似乎會返回正確的最小時間戳,因此在Vertica中比較時間戳可能會有一個怪癖嗎?試圖搜索目前的文檔... –

+0

這很奇怪。試圖強制它作爲'timestamp

+0

內部select中的'order by e.timestamp'似乎導致first_success_ts爲null的行。刪除它使它工作,但我不知道爲什麼。你知道嗎? –

0

這個查詢:

with F as 
(
    select * from Evts where eventName = 'fail' 
), 

S as 
(
    select * from Evts E 
     cross apply 
     (
      select count(F.eventID) numFailuresBeforeFirstSuccess from F 
       where F.userName = E.userName and 
         E.taskName = F.taskName and 
         F.timestamp < E.timestamp 
     ) K 

    where eventName = 'success' 
) 

select userName, taskName, numFailuresBeforeFirstSuccess from  
    (select *, row_number() over (partition by userName, taskName order by [timestamp] desc) o from S) S 
     where o = 1 

產生這樣的結果:

userName taskName numFailuresBeforeFirstSuccess 
----------- ----------- ----------------------------- 
John  TaskOne  4 
John  TaskThree 1 
John  TaskTwo  2 
Mike  TaskOne  0 
Mike  TaskThree 3 
Mike  TaskTwo  0 

previous explanation適用於這裏。

Rextester Demo

+0

看來,Vertica不支持APPLY語句。我無法在文檔中找到它,並且在Vertica中有相當於CROSS APPLY的問題(未回答):https://forum.vertica。COM /討論/ 238031 /等效的交叉應用,具有功能功能於Vertica的 –

+0

@AkiKanerva - 它看起來像Vertica的不支持'CROSS APPLY': https://my.vertica.com/get-啓動vertica/working-with-joins /,但語法不同。我認爲我們只需要弄清楚如何在Vertica中進行子查詢。 –

+0

嗯......我仍然收到'ERROR:語法錯誤在你的查詢的「OUTER」處或附近。 –

1

再次,這是TSQL而不是Vertica的,但它是相當標準的SQL,只要Vertica的支持的CTE。

; WITH cte1 AS (
    SELECT t1.userName, t1.taskName, t1.ts 
     , LAG(t1.ts) OVER (PARTITION BY t1.userName, t1.taskName ORDER BY t1.ts) AS PreviousTS 
     , ROW_NUMBER() OVER (PARTITION BY t1.userName ORDER BY t1.ts) AS rn 
    FROM #taskevents t1 
    WHERE t1.eventName = 'Success' 
) 
SELECT s1.userName, s1.taskName, AVG(s1.failCount) AS avgFailCount 
FROM (
    SELECT cte1.userName, cte1.taskName , cte1.rn, COALESCE(COUNT(t2.ts),0) AS failCount 
    FROM cte1 
    LEFT OUTER JOIN #taskevents t2 ON t2.userName = cte1.userName 
     AND t2.taskName = cte1.taskName 
     AND t2.ts < cte1.ts 
     AND (t2.ts >= cte1.PreviousTS OR cte1.PreviousTS IS NULL) 
     AND t2.eventName = 'fail' 
    GROUP BY cte1.userName, cte1.taskName, cte1.rn 
) s1 
GROUP BY s1.userName, s1.taskName 
ORDER BY s1.userName, s1.taskName 

這會給出您的平均值。刪除外部查詢以獲取我正在使用的數據。它產生的結果與你列出的結果略有不同,但應該給出你說的你想要的適當的平均值。如果我誤解了要求,請告訴我。

注意:在我的測試數據中,我還添加了兩個失敗但沒有成功的個人,只是爲了驗證他們被排除在結果之外。

, (2363476006456762398, 'fail', 'Steve', '2017-08-14 11:29:06.151', 'Task42', 145031342) 
, (2363476046456762368, 'fail', 'Joe', '2017-08-14 11:49:06.151', 'Task42', 145031399) 

=====================================

結果

----------------------------------- 
|userName| taskName |avgFailCount| 
----------------------------------- 
| John | TaskOne |  1  | 
| John | TaskThree |  1  | 
| John | TaskTwo |  0  | 
| Mike | TaskOne |  0  | 
| Mike | TaskThree |  1  | 
| Mike | TaskTwo |  0  | 
----------------------------------- 

============================================ ============================

編輯:對於一般僅由任務:

; WITH cte1 AS (
    SELECT t1.userName, t1.taskName, t1.ts 
     , LAG(t1.ts) OVER (PARTITION BY t1.userName, t1.taskName ORDER BY t1.ts) AS PreviousTS 
     , ROW_NUMBER() OVER (PARTITION BY t1.userName ORDER BY t1.ts) AS rn 
    FROM #taskevents t1 
    WHERE t1.eventName = 'Success' 
) 
SELECT s1.taskName 
    , AVG(CAST(s1.failCount AS decimal(5,2))) AS avgFailCount 
FROM (
    SELECT cte1.userName, cte1.taskName , cte1.rn, COALESCE(COUNT(t2.ts),0) AS failCount 
    FROM cte1 
    LEFT OUTER JOIN #taskevents t2 ON t2.userName = cte1.userName 
     AND t2.taskName = cte1.taskName 
     AND t2.ts < cte1.ts 
     AND (t2.ts >= cte1.PreviousTS OR cte1.PreviousTS IS NULL) 
     AND t2.eventName = 'fail' 
    GROUP BY cte1.userName, cte1.taskName, cte1.rn 
) s1 
GROUP BY s1.taskName 
ORDER BY s1.taskName 

,讓你

-------------------------- 
| taskName |avgFailCount| 
-------------------------- 
| TaskOne | 1.000000 | 
| TaskThree | 1.333333 | 
| TaskTwo | 0.400000 | 
-------------------------- 

基本上是

SELECT (3+1+0+0)/4.0 AS TaskOne 
SELECT (0+2+0+0+0)/5.0 AS TaskTwo 
SELECT (1+2+1)/3.0 AS TaskThree 

從以下數據點導出。

-------------------------------- 
|userName| taskName |FailCount| 
-------------------------------- 
| John | TaskOne | 3 | 
| John | TaskOne | 1 | 
| John | TaskOne | 0 | 
| Mike | TaskOne | 0 | 
| John | TaskTwo | 0 | 
| John | TaskTwo | 2 | 
| John | TaskTwo | 0 | 
| Mike | TaskTwo | 0 | 
| Mike | TaskTwo | 0 | 
| John | TaskThree | 1 | 
| Mike | TaskThree | 2 | 
| Mike | TaskThree | 1 | 
-------------------------------- 

這是成功之前失敗的平均數量,而不是每次嘗試失敗的平均數量。這會有些不同。

--------------------------------------------------- 
| task | fails | attempts | avg fails per attempt | 
--------------------------------------------------- 
| Task1| 4 | 8  | 4/8 = 0.500000  | 
| Task2| 2 | 7  | 2/7 = 0.285714  | 
| Task3| 3 | 7  | 3/7 = 0.428571  | 
--------------------------------------------------- 
+0

僅供參考:#taskevents是我的事件表的臨時表名稱。 – Shawn

+0

語法似乎在Vertica中可用,並且結果看起來不錯。我所尋找的平均值實際上是「平均嘗試,直到每個任務的首次成功」,而不是每個用戶。所以,如果在TaskOne中,John在成功之前有3次失敗,而Mike有0次,那麼平均值就是1.5。看着我原來的話雖然不是很清楚。 –

+1

如果您從SELECT,GROUP和ORDER中刪除t1.userName,它會爲您提供所需的平均值。還CAST s1.failCount作爲十進制數據類型,它應該將其分解爲十進制平均值。 – Shawn

相關問題