我想查詢Google BigQuery public Reddit數據集。我的目標是計算subreddits的使用Jaccards' Index的相似性,其定義爲:BigQuery - 複雜相關查詢
我的計劃是在2016年八月到選擇的評論數居前N = 1000個subreddits然後計算它們的笛卡爾產品以獲得subreddit1, subreddit2
形狀中的所有子分數的組合。
然後使用這些行組合來查詢subreddit1和subreddit 2之間的用戶聯合以及交集。
查詢我到目前爲止是這樣的:
SELECT
subreddit1,
subreddit2,
(SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
OR subreddit = subreddit2
LIMIT 1
) as subreddits_union,
(
SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
AND author IN (
SELECT author
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit= subreddit2
GROUP BY author
) as subreddits_intersection
FROM
(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2
FROM (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) a
CROSS JOIN (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) b
WHERE a.subreddit < b.subreddit
)
在理想情況下會給出結果:
subreddit1, subreddit2, subreddits_union, subreddits_interception
-----------------------------------------------------------------
Art | Politics | 50000 | 21000
Art | Science | 92320 | 15000
... | ... | ... | ...
然而,這個查詢給我下面的BigQuery錯誤: Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
我明白。不過,我不認爲這個查詢可以轉化爲有效的連接。鑑於BQ沒有應用方法,是否有任何方法可以設置此查詢而不訴諸個人查詢?也許與PARTITION BY
?
哦,男人,非常感謝!你的查詢都是我所需要的,而且他們的運行速度超快! –