2016-03-03 81 views
1

我正在計算Excel上的大型數據集的統計數據,並且由於數據集大小而遇到一些問題。Excel/VBA中的大型數據集的多條件統計信息(平均值,標準偏差,z分數)

看起來VBA可能是要走的路,因爲複製AVERAGEIF和STDDEV陣列函數跨越數據,這個大小會導致很長的計算時間。欣賞可能在此使用的解決方案或代碼。

目標:

  • 要計算統計信息(avg,標準偏差,z分數)條件2點的標識符(例如,在01/01/10所有高度的平均值)
  • 能夠處理大的數據集(100K +數據點)

示例數據:

Date | User ID | Indicator | Data Point 
01/01/10| 1  | Height | 150 
01/01/10| 1  | Weight | 123 
01/01/10| 2  | Height | 146 
01/01/10| 2  | Weight | 123 
01/02/10| 1  | Height | 156 
01/02/10| 1  | Weight | 160 
01/02/10| 2  | Height | 103 
01/02/10| 2  | Weight | 109 

編輯:

對於新列中的每個數據點,預期輸出將理想地設置爲z分數。 例子:第一,z分數將被標準化爲所有高度上與01/01/10:

(150 - avg)/stdev 
+0

你會如何輸出?請在原始帖子中顯示預期的輸出。 –

+1

如果你不限於VBA,我建議使用Python的Pandas庫。將熊貓數據框與xlwings結合起來,您可以輕鬆地將大量數據導入/導出到Excel電子表格中。 –

+0

不幸的是在這種情況下僅限於VBA。也許有一些東西可以用於未來。 – mpny1

回答

0

我不知道什麼是z得分,因爲我得到相同的(+/- )所有數據點的值。但我相信你將能夠修改代碼以獲得你想要的。 數據應該位於工作表「數據」中,其中有一個名爲Go的命令按鈕,用於執行代碼。 小心!代碼將清除E列以後的所有內容。

Dim lLastRowDB As Long 
    Dim dU1 As Object, cU1 As Variant, iU1 As Long, lrU As Long 
    Dim dU2 As Object, cU2 As Variant, iU2 As Long 
    Dim MyArray() As Variant 
    Dim lAV As Double 
    Dim lSD As Double 
    Dim i As Integer 
    Dim j As Integer 
    Dim k As Integer 

    Private Sub Go_Click() 
    Worksheets("Data").Columns("E:EZ").Delete Shift:=xlToLeft 'Clear previous results 
    lLastRowDB = Worksheets("Data").Cells(2, 1).End(xlDown).Row 'Assuming your data starts in A2 

    'Indexes from Column1 (Dates) 
    Set dU1 = CreateObject("Scripting.Dictionary") 
    lrU = Cells(Rows.Count, 1).End(xlUp).Row 
    cU1 = Range("A2:A" & lrU) 
    For iU1 = 1 To UBound(cU1, 1) 
     dU1(cU1(iU1, 1)) = 1 
    Next iU1 

    'Indexes from Column3 (Indicators) 
    Set dU2 = CreateObject("Scripting.Dictionary") 
    cU2 = Range("C2:C" & lrU) 
    j = 0 
    For iU2 = 1 To UBound(cU2, 1) 
     dU2(cU2(iU2, 1)) = 1 
    Next iU2 

    'If want to see values in dictionaries, uncomment following six lines 

    'For i = 0 To dU1.Count - 1 
    ' MsgBox "dU1 has " & dU1.Count & " elements and key#" & i & " is " & dU1.Keys()(i) 
    'Next 
    'For i = 0 To dU2.Count - 1 
    ' MsgBox "dU2 has " & dU2.Count & " elements and key#" & i & " is " & dU2.Keys()(i) 
    'Next 

    'The following code will look in the complete set of data for each index 
    'This accounts for unsorted data, but is resourse-consuming 
    'If your data is ordered for shure, just loop the desired rows 

    For i = 0 To dU1.Count - 1 'for each Date 
     For j = 0 To dU2.Count - 1 'for each Indicator 
      ReDim MyArray(1 To 1) As Variant 'reset the array 
      For k = 2 To lLastRowDB 'Scan all rows 
       If (Worksheets("Data").Cells(k, 1).Value = dU1.keys()(i)) Then 
        If (Worksheets("Data").Cells(k, 3).Value = dU2.keys()(j)) Then 
         MyArray(UBound(MyArray)) = Worksheets("Data").Cells(k, 4).Value 'add found value to array 
         ReDim Preserve MyArray(1 To UBound(MyArray) + 1) As Variant 'now array is 1 element longer 
        End If 
       End If 
      Next 
      'Now MyArray contains desired data. 
      'Get average and SD 
      lAV = Application.WorksheetFunction.Average(MyArray) 
      lSD = Application.WorksheetFunction.StDev(MyArray) 
      'Titles 
      Worksheets("Data").Cells(1, 5) = "Average" 
      Worksheets("Data").Cells(1, 6) = "SD" 
      Worksheets("Data").Cells(1, 7) = "z-scores" 

      For k = 2 To lLastRowDB 
       If (Worksheets("Data").Cells(k, 1).Value = dU1.keys()(i)) Then 
        If (Worksheets("Data").Cells(k, 3).Value = dU2.keys()(j)) Then 
         Worksheets("Data").Cells(k, 5) = lAV 
         Worksheets("Data").Cells(k, 6) = lSD 
         If lSD = 0 Then 
          Worksheets("Data").Cells(k, 7) = "SD is zero. Unable to calculate z-scores" 
         Else 
          Worksheets("Data").Cells(k, 7) = (Worksheets("Data").Cells(k, 4).Value - lAV)/lSD 'z-scores 
         End If 
        End If 
       End If 
      Next 
     Next 
    Next 

    End Sub 
相關問題