2017-06-15 21 views
0

我在BigQuery中有以下基因表(超過12K行)。在PIK3CA_features(列2)的長列表中與同一sample_id(第1列)bigQuery - 如何使用行值爲新表創建列

Row sample_id PIK3CA_features 
1 hu011C57 chr3_3930069__TGT  
2 hu011C57 chr3_3929921_TC 
3 hu011C57 chr3_3929739_TC 
4 hu011C57 chr3_3929813__T 
5 hu011C57 chr3_3929897_GA 
6 hu011C57 chr3_3929977_TC 
7 hu011C57 chr3_3929783_TC 

我想生成如下表:

Row sample_id chr3_3930069__TGT chr3_3929921_TC chr3_3929739_TC 
1 hu011C57 1     1    0 
2 hu011C58 0  

含義,一個排對於每個樣本ID和1/0,如果PIK3CA_feature存在於此樣本。

任何想法如何輕鬆地生成此表?

非常感謝您的任何想法!

回答

0

您可以通過對示例ID進行分組來完成此操作。

SELECT 
    sample_id, 
    COUNTIF(PIK3CA_features = 'chr3_3930069__TGT') as chr3_3930069__TGT, 
    COUNTIF(PIK3CA_features = 'chr3_3929921_TC') as chr3_3929921_TC, 
    COUNTIF(PIK3CA_features = 'chr3_3929739_TC') as chr3_3929739_TC 
FROM [your_table] 
GROUP BY sample_id; 

假設您沒有每個樣品ID的重複PIK3CA_features,這應該給你你所需要的。

1

想到的是使用ARRAYS and STRUCTS概念得到一定程度接近你需要什麼,像這樣唯一的想法:

WITH data AS(
SELECT 'hu011C57' sample_id, 'chr3_3930069__TGT' PIK3CA_features union all 
SELECT 'hu011C57', 'chr3_3929921_TC' union all 
SELECT 'hu011C57', 'chr3_3929739_TC' union all 
SELECT 'hu011C57', 'chr3_3929813__T' union all 
SELECT 'hu011C57', 'chr3_3929897_GA' union all 
SELECT 'hu011C57', 'chr3_3929977_TC' union all 
SELECT 'hu011C57', 'chr3_3929783_TC' union all 
SELECT 'hu011C58', 'chr3_3929783_TC' union all 
SELECT 'hu011C58', 'chr3_3929921_TC' 
), 

all_features AS (
    SELECT DISTINCT PIK3CA_features FROM data 
), 

aggregated_samples AS(
    SELECT 
    sample_id, 
    ARRAY_AGG(DISTINCT PIK3CA_features) features 
FROM data 
GROUP BY sample_id 
) 

SELECT 
    sample_id, 
    ARRAY(SELECT AS STRUCT PIK3CA_features, PIK3CA_features IN (SELECT feature FROM UNNEST(features) feature) FROM all_features AS present ORDER BY PIK3CA_features) features 
FROM aggregated_samples 

這將返回給你每sample_id一個行的記者陣列每個特徵結構都在sample_id中存在。

由於BigQuery原生支持這種類型的數據結構,因此您可以在不丟失高級分析(如使用分析函數,子查詢等)的任何容量的情況下擁有這種數據表示。

相關問題