2017-03-01 38 views
1

我有一個類似這樣的表格,其中經常以相反的順序與配對關係計數。在Big Query中總結/合併/組合反轉對

country1 country2 count 
CHN   KOR   65 
TWN   KOR   32 
KOR   CHN   43 

這裏我有CHN - KOR和KOR - CHN。如果我已經確定這些是不同的罪名,那麼這些只是代表描述的關係的兩種方式,我想總結的對的計數,因此最終的結果是

country1 country2 count 
CHN   KOR   108 
TWN   KOR   32 

我用大查詢。有誰知道在SQL中整合反轉對的方法嗎?注意:這些都不是重複的,所以這不是刪除重複的問題,但結合逆轉對

回答

1

這裏有一個方法:

select country1, country2, sum(count) 
from ((select country1, country2, count 
     from t 
     where country1 <= country2 
    ) union all 
     (select country2, country1, count 
     from t 
     where country1 > country2 
    ) 
    ) cc 
group by country1, country2; 

這將爲傳統的和標準的接口都工作。對於標準,BigQuery支援對字符串greatest()least()

select least(country1, country2), greatest(country1, country2), sum(count) 
from ((select country1, country2, count 
     from t 
     where country1 <= country2 
    ) union all 
     (select country2, country1, count 
     from t 
     where country1 > country2 
    ) 
    ) cc 
group by 1, 2; 
+0

只是一個小小的更正:BigQuery標準SQL不允許GROUP BY中的表達式,所以您的解決方案應該被更正爲「group by 1,2」。 –

+0

@MoshaPasumansky。 。 。謝謝。 –

3

另一種選擇,可顯示電源和BigQuery的標準的涼意SQL

#standardSQL 
WITH pairs AS (
    SELECT 
    (SELECT STRING_AGG(country ORDER BY country) 
     FROM UNNEST(ARRAY[country1, country2]) AS country 
    ) AS countries, 
    SUM(COUNT) AS COUNT 
    FROM yourTable 
    GROUP BY countries 
) 
SELECT 
    REGEXP_EXTRACT(countries, r'(\w+),') AS country1, 
    REGEXP_EXTRACT(countries, r',(\w+)') AS country2, 
    COUNT 
FROM pairs 

此版本可以更爲優化,當你有不只是兩個領域更是「錯誤命令」

可以簡要地測試它下面的虛擬數據

#standardSQL 
WITH yourTable AS (
SELECT 'CHN' AS country1, 'KOR' AS country2, 65 AS COUNT UNION ALL 
SELECT 'TWN', 'KOR', 32 UNION ALL 
SELECT 'KOR', 'CHN', 43 
) 

而下面是當多於兩個字段洗牌

#standardSQL 
WITH yourTable AS (
SELECT 'CHN' AS country1, 'KOR' AS country2, 'US' as country3, 65 AS COUNT UNION ALL 
SELECT 'TWN', 'KOR', 'GB', 32 UNION ALL 
SELECT 'KOR', 'US', 'CHN', 43 
), 
pairs AS (
    SELECT 
    (SELECT STRING_AGG(country ORDER BY country) 
     FROM UNNEST(ARRAY[country1, country2, country3]) AS country 
    ) AS countries, 
    SUM(COUNT) AS COUNT 
    FROM yourTable 
    GROUP BY countries 
) 
SELECT 
    REGEXP_EXTRACT(countries, r'(\w+),\w+,\w+') AS country1, 
    REGEXP_EXTRACT(countries, r'\w+,(\w+),\w+') AS country2, 
    REGEXP_EXTRACT(countries, r'\w+,\w+,(\w+)') AS country3, 
    COUNT 
FROM pairs 

當然,可以進一步優化箱子快速的例子,但這裏主要着眼於洗牌的邏輯不需要多重比較/等

加成

謝謝@GordonLinoff下面選擇堅持!我認爲你是正確的 - 這是更優雅的使用ARRAY_AGG這裏

#standardSQL 
WITH yourTable AS (
SELECT 'CHN' AS country1, 'KOR' AS country2, 'US' AS country3, 65 AS count UNION ALL 
SELECT 'TWN', 'KOR', 'GB', 32 UNION ALL 
SELECT 'KOR', 'US', 'CHN', 43 
), 
pairs AS (
    SELECT 
    (SELECT ARRAY_AGG(country ORDER BY country) 
     FROM UNNEST(ARRAY[country1, country2, country3]) AS country 
    ) AS countries, 
    count 
    FROM yourTable 
) 
SELECT 
    countries[OFFSET(0)] AS country1, 
    countries[OFFSET(1)] AS country2, 
    countries[OFFSET(2)] AS country3, 
    SUM(count) AS count 
FROM pairs 
GROUP BY 1, 2, 3 
+0

爲什麼你'string_agg(''的ARRAY_AGG代替)()'和剛拉出來的元素? –

+0

@GordonLinoff - 你也可以試試。只是想給方向不同於其他答案中給出的經典/明顯的方向。如果你喜歡它,你可以使用這個函數:o) –

+0

如果你使用了數組,我會贊成。使用一個字符串來表示列表而不是數組不是一般的而只是不雅觀的。 –