2016-09-13 50 views

回答

7

以下查詢使用數值穩定且易於修改的計算來執行線性迴歸,以處理任何輸入表。它使用內建函數CORR生成最適合模型Y = SLOPE * X + INTERCEPT和Pearson相關係數的斜率和截距。

作爲一個例子,我們使用公共天賦數據集來計算出生體重作爲妊娠持續時間的線性函數,按照狀態進行細分。你可以寫得更緊湊,但我們使用幾層子查詢來突出顯示這些部分如何結合在一起。要將其應用於其他數據集,只需要替換最內層的查詢。

SELECT Bucket, 
     SLOPE, 
     (SUM_OF_Y - SLOPE * SUM_OF_X)/N AS INTERCEPT, 
     CORRELATION 
FROM (
    SELECT Bucket, 
      N, 
      SUM_OF_X, 
      SUM_OF_Y, 
      CORRELATION * STDDEV_OF_Y/STDDEV_OF_X AS SLOPE, 
      CORRELATION 
    FROM (
     SELECT Bucket, 
       COUNT(*) AS N, 
       SUM(X) AS SUM_OF_X, 
       SUM(Y) AS SUM_OF_Y, 
       STDDEV_POP(X) AS STDDEV_OF_X, 
       STDDEV_POP(Y) AS STDDEV_OF_Y, 
       CORR(X,Y) AS CORRELATION 
     FROM (SELECT state AS Bucket, 
        gestation_weeks AS X, 
        weight_pounds AS Y 
       FROM [publicdata.samples.natality]) 
     WHERE Bucket IS NOT NULL AND 
       X IS NOT NULL AND 
       Y IS NOT NULL 
     GROUP BY Bucket)); 

使用STDDEV_POP和CORR功能提高了查詢的數值穩定性比較X和Y的產品總結,然後採取區別和劃分,但如果你在一個乖巧的數據集使用這兩種方法,你可以驗證他們產生相同的結果以高精度。