2014-04-29 51 views
2

我正在對Google BigQuery中的publicdata:samples.github_timeline數據集進行漏斗分析。我想按時間順序提取所有執行一系列三個事件的獨特用戶。將GoogleBoundQuery中的獨立行分組需要更長的時間

的事件和順序:

  • WatchEvent
  • PushEvent
  • CreateEvent

這是查詢:

select user from (
    SELECT user1 as user, 
     ts1 as eventDate1, 
     ts2 as eventDate2, 
     IF(ts2 < ts3, ts3, NULL) as eventDate3 
    FROM 
     (SELECT user1, 
     ts1, 
      ts2, 
      ts3 
      FROM (SELECT user1, 
    ts1, 
    IF(ts1 < ts2, ts2, NULL) as ts2 
    FROM 
    (SELECT user1, 
    ts1, 
    ts2 
    FROM (SELECT repository_owner as user1, 
    created_at as ts1 
    FROM [publicdata:samples.github_timeline] 
    WHERE type = "WatchEvent") as step1 
    LEFT JOIN EACH (SELECT repository_owner as user2, 
    created_at as ts2 
    FROM [publicdata:samples.github_timeline] 
    WHERE type = "PushEvent") as step2 
    ON user1 = user2 where ts1 is not NULL) 

    ) as steps1_2 
      LEFT JOIN (SELECT repository_owner as user3, 
      created_at as ts3 
    FROM [publicdata:samples.github_timeline] 
    WHERE type = "CreateEvent") as step3 
      ON user1 = user3 
      where ts2 is not NULL 
      ) 
    ) 
    where eventDate3 is not null 
    group by user 
    limit 100 

沒有GROUP BY用戶在結束它非常快(10秒)。但是當我添加它時,完成需要很多時間(超過20分鐘)。

查詢有什麼問題? 您可以在這裏測試查詢:https://bigquery.cloud.google.com/

回答

1

如果在非分組查詢中使用「limit 100」,orchestrator將在獲取前100個數據行後中斷執行。

「按用戶分組限制100」要求所有數據行必須在分組之前進行計算。然後執行分組。最後,「限制100」生效。

+0

我試過無覆蓋超過極限100,花了23分鐘。 –

3

你有一個連接爆炸;也就是說,如果用戶A有20個WatchEvent,20個PushEvents和20個CreateEvents,則您的查詢可以從這60箇中生成8000行。這是因爲當JOIN兩邊有多個匹配鍵時,它會生成笛卡爾積雙方。您可以通過儘量縮短匹配時間來解決此問題,因此您只需查看最小WatchEvent時間,以便用戶查找後續PushEvent時間,然後查看晚於WatchEvent時間的最小pushEvent時間以查找匹配CreateEvent時間。

下面是在大約20秒運行查詢:

SELECT user 
FROM (
    SELECT step2_2.user1 as user, 
    MIN(step2_2.ts1) as eventDate1, 
    MIN(step2_2.ts2) as eventDate2, 
    MIN(step3.ts3) as eventDate3 
    FROM (
    SELECT user1, MIN(ts1) as ts1, MIN(ts2) as ts2 
    FROM (
    SELECT repository_owner as user1, 
    MIN(created_at) as ts1 
    FROM [publicdata:samples.github_timeline] 
    WHERE type = "WatchEvent" 
    GROUP EACH BY user1) as step1 
    JOIN EACH (
    SELECT repository_owner as user2, 
     created_at as ts2 
    FROM [publicdata:samples.github_timeline] 
    WHERE type = "PushEvent") as step2 
    ON user1 = user2 
    WHERE ts1 < ts2 
    GROUP EACH BY user1 
) as step2_2 
    JOIN EACH (
    SELECT repository_owner as user3, 
     created_at as ts3 
    FROM [publicdata:samples.github_timeline] 
    WHERE type = "CreateEvent") as step3 
    ON user1 = user3 
    WHERE step2_2.ts2 < step3.ts3 
    GROUP EACH BY user 
) 
GROUP BY user 
LIMIT 100 
2

如果您的數據集不是太大,你可以用鉛()窗函數找到序列和完全避免加入。

Select repository_owner 
FROM 
(
Select repository_owner,type as Event0, 
LEAD(x,1) OVER(Partition by repository_owner order by ts) as Event1, 
LEAD(x,2) OVER(Partition by repository_owner order by ts) as Event2, 

FROM 
(
SELECT repository_owner as user,created_at as ts,type as x 
from [publicdata:samples.github_timeline] 
where type in ("WatchEvent","PushEvent","CreateEvent") 
)) 
where Event0="WatchEvent" 
and Event1="PushEvent" 
and Event2="CreateEvent" 

Group by repository_owner 

7秒......

如果事件(指喬丹的評論)在「背靠背令」都沒有,需要使其更復雜一點:

Select repository_owner from 
(
Select repository_owner,Event0,Event1, 
Lead(Event0,1) OVER (Partition by repository_owner order by ts) as Event2, 
Lead(Event1,1) OVER (Partition by repository_owner order by ts) as Event3, 
FROM 
(Select * from 
(Select repository_owner,type as Event0,ts, 
LEAD(x,1) OVER(Partition by repository_owner order by ts) as Event1, 
FROM 
(
SELECT repository_owner as user,created_at as ts,type as x 
from [publicdata:samples.github_timeline] 
where type in ("WatchEvent","PushEvent","CreateEvent") 
)) 
where (Event0="WatchEvent" and 
    Event1 in("PushEvent" ,"CreateEvent")) 
OR (Event1="CreateEvent" and 
    Event0 in("PushEvent" ,"WatchEvent"))) 
) 
    Where Event0="WatchEvent" and 
     (Event1="PushEvent" Or Event2="PushEvent") and 
     Event3="CreateEvent" 
Group by repository_owner 

如果數據集太大,那麼你打這個問題:Parallelizable OVER EACH BY

希望它有助於

+0

我認爲這個解決方案比我建議的方案更優雅,但是這樣做只有在這個事件按照這個順序背對背時才起作用?例如,如果有一個WatchEvent/PushEvent/WatchEvent/CreateEvent,這不會匹配嗎? –

+0

編輯我的答案,爲您提到的場景添加解決方案。 –