2017-05-30 10 views
1

在我的BigQuery資料表的BigQuery資料表蔓延秒稱爲[遊戲](約5萬行),在那裏我有以下結構行:重複對方

user_id game_id game_play_time 
1234567 3444432 2017-05-30 15:26:57 UTC 
1234567 3444432 2017-05-30 15:26:58 UTC 
1234567 3444432 2017-05-30 15:26:59 UTC 
9876544 8586588 2017-05-30 23:26:11 UTC 
4638889 8698798 2017-05-30 15:26:58 UTC 
4638889 8698798 2017-05-30 15:27:58 UTC 

我需要刪除其擁有的行相同的user_id和game_id但後續遊戲之間的時間差等於或小於1秒(保持第一次出現)。

結果應該如下:

user_id game_id game_play_time 
1234567 3444432 2017-05-30 15:26:57 UTC 
9876544 8586588 2017-05-30 23:26:11 UTC 
4638889 8698798 2017-05-30 15:26:58 UTC 
4638889 8698798 2017-05-30 15:27:58 UTC 
+0

基於在你的問題輸入數據 - 請顯示預期的輸出! –

+0

@MikhailBerlyant - 剛剛添加了輸出 – BrunoGG

回答

1

下面是BigQuery的標準SQL

#standardSQL 
SELECT 
    user_id, 
    game_id, 
    MIN(game_play_time) AS game_play_time 
FROM (
    SELECT 
    user_id, 
    game_id, 
    game_play_time, 
    SUM(step) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time) AS grp 
    FROM (
    SELECT 
     user_id, 
     game_id, 
     game_play_time, 
     CASE WHEN IFNULL(TIMESTAMP_DIFF(game_play_time, LAG(game_play_time) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time), SECOND), 0) > 1 THEN 1 ELSE 0 END AS step 
    FROM YourTable 
) 
) 
GROUP BY user_id, game_id, grp 
-- ORDER BY user_id, game_id, grp 

你可以測試它下面的虛擬數據(從例如在你的問題+幾行,以使其更通用)

#standardSQL 
WITH YourTable AS(
    SELECT '1234567' AS user_id, '3444432' AS game_id, TIMESTAMP('2017-05-30 12:26:57') game_play_time UNION ALL 
    SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 12:26:57') UNION ALL 
    SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 13:26:57') UNION ALL 
    SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 13:26:57') UNION ALL 
    SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:57') UNION ALL 
    SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:57') UNION ALL 
    SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:58') UNION ALL 
    SELECT '1234567', '3444432', TIMESTAMP('2017-05-30 15:26:59') UNION ALL 
    SELECT '9876544', '8586588', TIMESTAMP('2017-05-30 23:26:11') UNION ALL 
    SELECT '4638889', '8698798', TIMESTAMP('2017-05-30 15:26:58') UNION ALL 
    SELECT '4638889', '8698798', TIMESTAMP('2017-05-30 15:27:58') 
) 
SELECT 
    user_id, 
    game_id, 
    MIN(game_play_time) AS game_play_time 
FROM (
    SELECT 
    user_id, 
    game_id, 
    game_play_time, 
    SUM(step) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time) AS grp 
    FROM (
    SELECT 
     user_id, 
     game_id, 
     game_play_time, 
     CASE WHEN IFNULL(TIMESTAMP_DIFF(game_play_time, LAG(game_play_time) OVER(PARTITION BY user_id, game_id ORDER BY game_play_time), SECOND), 0) > 1 THEN 1 ELSE 0 END AS step 
    FROM YourTable 
) 
) 
GROUP BY user_id, game_id, grp 
-- ORDER BY user_id, game_id, grp 
+0

@BrunoGG - 你有機會嘗試嗎? –

+0

謝謝:)對不起,遲交回復。 – BrunoGG

2

它是否適合你?

SELECT 
    user_id, 
    game_id, 
    MIN(game_play_time) game_play_time 
FROM(
    SELECT 
    user_id, 
    game_id, 
    game_play_time, 
    lead_time, 
    (UNIX_SECONDS(lead_time) - UNIX_SECONDS(game_play_time) <= 1) diff 
FROM(
    SELECT 
    user_id, 
    game_id, 
    game_play_time game_play_time, 
    LEAD(game_play_time,1) OVER(PARTITION BY user_id, game_id order by game_play_time) lead_time 
FROM data 
) 
) 
GROUP BY user_id,game_id, diff 
ORDER BY user_id, game_id, game_play_time 

,其中數據是輸入數據,我定義它像這樣:

WITH data AS(
select '1234567' as user_id, '3444432' as game_id, timestamp('2017-05-30 15:26:57') game_play_time union all 
select '1234567' as user_id, '3444432' as game_id, timestamp('2017-05-30 15:26:58') game_play_time union all 
select '1234567' as user_id, '3444432' as game_id, timestamp('2017-05-30 15:26:59') game_play_time union all 
select '9876544' as user_id, '8586588' as game_id, timestamp('2017-05-30 23:26:11') game_play_time union all 
select '4638889' as user_id, '8698798' as game_id, timestamp('2017-05-30 15:26:58') game_play_time union all 
select '4638889' as user_id, '8698798' as game_id, timestamp('2017-05-30 15:27:58') game_play_time 
) 

即使它似乎是在這裏工作,我不知道,如果仍然有一些角落情況下它贏得了」工作。也許數據中的結果可能會顯示一切正常。

+0

謝謝!如果使用'MIN(game_play_time)',問題在於即使它們相隔超過一秒(我們想保留的行),也只能保持第一次出現 - 通過刪除它並僅保留'game_play_time'和然後將它添加到GROUP BY中似乎有用。我需要做更多的驗證。 – BrunoGG

+0

嗨,你可以舉一個你想要的輸出的例子嗎?我以爲你只想保持第一次發生,但也許你有順序行的其他規則。 –