2016-10-03 76 views
1

我想查詢Google BigQuery public Reddit數據集。我的目標是計算subreddits的使用Jaccards' Index的相似性,其定義爲:BigQuery - 複雜相關查詢

Jaccards Formula

我的計劃是在2016年八月到選擇的評論數居前N = 1000個subreddits然後計算它們的笛卡爾產品以獲得subreddit1, subreddit2形狀中的所有子分數的組合。

然後使用這些行組合來查詢subreddit1和subreddit 2之間的用戶聯合以及交集。

查詢我到目前爲止是這樣的:

SELECT 
    subreddit1, 
    subreddit2, 
    (SELECT 
    COUNT(DISTINCT author) 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    WHERE subreddit = subreddit1 
    OR subreddit = subreddit2 
    LIMIT 1 
) as subreddits_union, 

    (
    SELECT 
     COUNT(DISTINCT author) 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    WHERE subreddit = subreddit1 
    AND author IN ( 
     SELECT author 
     FROM `fh-bigquery.reddit_comments.2016_08` 
     WHERE subreddit= subreddit2 
     GROUP BY author 
    ) as subreddits_intersection 

FROM 

(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2 
FROM (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 1000 
    ) a 
CROSS JOIN (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 1000 
    ) b 
WHERE a.subreddit < b.subreddit 
) 

在理想情況下會給出結果:

subreddit1, subreddit2, subreddits_union, subreddits_interception 
----------------------------------------------------------------- 
    Art  | Politics |  50000  |  21000 
    Art  | Science |  92320  |  15000 
    ...  | ...  |  ...  |  ... 

然而,這個查詢給我下面的BigQuery錯誤: Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

我明白。不過,我不認爲這個查詢可以轉化爲有效的連接。鑑於BQ沒有應用方法,是否有任何方法可以設置此查詢而不訴諸個人查詢?也許與PARTITION BY

回答

1

Thanks for your answer. This one works pretty well in returning the subreddit union , however, how would you implement the intersection ?

也許一些沿

WITH top_most AS (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 20 
), 
authors AS (
    SELECT DISTINCT author, subreddit 
    FROM `fh-bigquery.reddit_comments.2016_08` 
) 
SELECT 
count(DISTINCT a1.author), 
subreddit1, subreddit2 
FROM 
(
    SELECT t1.subreddit subreddit1, t2.subreddit subreddit2 
    FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000 
) 
INNER JOIN authors a1 on a1.subreddit = subreddit1 
INNER JOIN authors a2 on a2.subreddit = subreddit2 
WHERE a1.author = a2.author 
GROUP BY subreddit1, subreddit2 
ORDER BY subreddit1, subreddit2 
+0

哦,男人,非常感謝!你的查詢都是我所需要的,而且他們的運行速度超快! –

1

不知道我完全理解你嘗試計算的東西。但也許這個例子可以幫助想出解決辦法:

SELECT 
    subreddit1, 
    subreddit2, 
    COUNT(DISTINCT author) 
FROM 
`fh-bigquery.reddit_comments.2016_08` as f 
CROSS JOIN 
(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2 
FROM (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 10 
    ) a 
CROSS JOIN (
    SELECT subreddit, count(*) as n_comments 
    FROM `fh-bigquery.reddit_comments.2016_08` 
    GROUP BY subreddit 
    ORDER BY n_comments DESC 
    LIMIT 10 
    ) b 
WHERE a.subreddit < b.subreddit 
LIMIT 1000000 
) 
WHERE f.subreddit = subreddit1 OR f.subreddit = subreddit2 
GROUP BY subreddit1, subreddit2 
ORDER BY subreddit1, subreddit2 
+0

謝謝您的回答線。這個在返回subreddit union方面工作得很好,但是,你將如何實現交集? –