2012-09-18 39 views
2

有沒有人知道是否可以在不使用VBA的情況下在Excel中實現聚類算法,如k-means,dbscan?excel中的集羣

如果有可能,你可以給我一點幫助,如何做到這一點? (一個小例子會幫我)

感謝在adavance

克勞德

+1

通過谷歌,我發現這一點:[以 在Microsoft Excel 2007中不使用宏進行聚類]聲稱(http://www.ijcaonline.org/volume11 /number7/pxc3872144.pdf)我並不是很熟悉聚類分析,所以我無法確認它。 –

+0

Excel中需要完成一個深層的內在原因嗎?爲什麼不轉儲一張表並使用更快,更合適,更靈活的內容? – micans

回答

1

爲什麼你會? Excel並不意味着這一點。

聚類算法經常從使用索引結構中獲益很大,它以智能的方式組織內存中的數據。例如R *樹,kd樹等。

這產生巨大的差異。沒有索引的DBSCAN的複雜度爲O(n^2),而索引結構僅爲O(n log n)的複雜度。

你可能可以做到這一點在VBA(這不是真的Excel了,但Visual Basic),我想。但它更有意義使用現有的代碼爲R * - 樹等

1

this檢查

的K-means算法

的K-means算法是下述步驟,直到迭代穩定性得以實現,即個體記錄的羣集分配不再改變。

確定質心的座標。 (最初質心是隨機的,唯一的點,之後將該組的成員的平均座標分配給質心)。 確定每條記錄對每個質心的歐氏距離。 用最接近質心的組記錄。 代碼

首先,我創建了一個私有類型來表示我們的記錄和質心,並創建了兩個類級別的數組來保存它們以及一個類級別的變量以保存正在執行分析的表。

Private Type Records 
    Dimension() As Double 
    Distance() As Double 
    Cluster As Integer 
End Type 

Dim Table As Range 
Dim Record() As Records 
Dim Centroid() As Records 
User Interface 

以下的方法,Run()可以用作起始點和鉤掛到按鈕等

Sub Run() 
'Run k-Means 
    If Not kMeansSelection Then 
     Call MsgBox("Error: " & Err.Description, vbExclamation, "kMeans Error") 
    End If 
End Sub 

接着,將創建一個方法,該方法提示用戶選擇要分析的表和輸入數據應分組到的所需數量的簇。該函數不需要任何參數並返回一個布爾值,指示是否遇到任何錯誤。

Function kMeansSelection() As Boolean 

'Get user table selection 
    On Error Resume Next 
    Set Table = Application.InputBox(Prompt:= _ 
            "Please select the range to analyse.", _ 
            title:="Specify Range", Type:=8) 

    If Table Is Nothing Then Exit Function  'Cancelled 

    'Check table dimensions 
    If Table.Rows.Count < 4 Or Table.columns.Count < 2 Then 
     Err.Raise Number:=vbObjectError + 1000, Source:="k-Means Cluster Analysis", Description:="Table has insufficent rows or columns." 
    End If 

    'Get number of clusters 
    Dim numClusters As Integer 
    numClusters = Application.InputBox("Specify Number of Clusters", "k Means Cluster Analysis", Type:=1) 

    If Not numClusters > 0 Or numClusters = False Then 
     Exit Function  'Cancelled 
    End If 
    If Err.Number = 0 Then 
     If kMeans(Table, numClusters) Then 
      outputClusters 
     End If 
    End If 

kMeansSelection_Error: 
    kMeansSelection = (Err.Number = 0) 
End Function 

如果一個表已被選擇並適當地定義的簇數,所述K均值(表,numClusters)方法被調用以集羣爲參數的表和數量。

如果kMeans (Table, numClusters)方法執行時沒有錯誤,則會調用最後一個方法outputClusters(),它將在活動工作簿中創建一個新工作表並輸出分析結果。

分配記錄到集羣

這是記錄的實際分析發生和集羣分配製成。 首先,該方法是用函數kMeans(Table As Range, Clusters As Integer) As Boolean聲明的。該函數接受兩個參數,表格被分析爲一個Excel範圍對象和集羣,一個整數表示要創建的集羣數量。

Function kMeans(Table As Range, Clusters As Integer) As Boolean 
'Table - Range of data to group. Records (Rows) are grouped according to attributes/dimensions(columns) 
'Clusters - Number of clusters to reduce records into. 

    On Error Resume Next 

    'Script Performance Variables 
    Dim PassCounter As Integer 

    'Initialize Data Arrays 
    ReDim Record(2 To Table.Rows.Count) 
    Dim r As Integer  'record 
    Dim d As Integer  'dimension index 
    Dim d2 As Integer  'dimension index 
    Dim c As Integer  'centroid index 
    Dim c2 As Integer  'centroid index 
    Dim di As Integer  'distance 

    Dim x As Double  'Variable Distance Placeholder 
    Dim y As Double  'Variable Distance Placeholder 

On error Resume Next用於傳遞誤差達到調用方法,以及數數組索引變量的聲明。 x和y被聲明爲以後在數學運算中使用。

第一步是將Record()數組的大小設置爲表中的行數。 (2到Table.Rows.Count)被使用,因爲它假定(和需要)該表的第一行包含列標題。

然後,對於Record()陣列中的每個記錄中,記錄類型的Dimension()陣列的尺寸設置成列(再次假設第一列包含行標題)的號碼和Distance()陣列的尺寸以簇的數目。然後內部循環將該行中列的值分配給Dimension()陣列。

對於R = LBOUND(記錄)UBound函數(記錄) '初始化尺寸值陣列 使用ReDim記錄(R).Dimension(2要Table.columns.Count) ' 初始化距離陣列 使用ReDim記錄(R) (r).Dimension) Record(r).Dimension(d)= Table.Rows(r).Cells(d).Dellnsion(r) d)。值 Next d Next r

以幾乎相同的方式,初始質心必須初始化。我已將前幾個記錄的座標分配爲初始質心座標,以檢查每個新質心具有唯一座標。如果不是,腳本直接移動到下一個記錄,直到找到一組唯一的質心座標。

歐幾里德距離此處用於計算質心唯一性的方法與稍後用於計算單個記錄與質心之間距離的方法幾乎完全相同。這裏的質心從0

'Initialize Initial Centroid Arrays 
    ReDim Centroid(1 To Clusters) 
    Dim uniqueCentroid As Boolean 

    For c = LBound(Centroid) To UBound(Centroid) 
     'Initialize Centroid Dimension Depth 
     ReDim Centroid(c).Dimension(2 To Table.columns.Count) 

     'Initialize record index to next record 
     r = LBound(Record) + c - 2 

     Do  ' Loop to ensure new centroid is unique 
      r = r + 1  'Increment record index throughout loop to find unique record to use as a centroid 

      'Assign record dimensions to centroid 
      For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension) 
       Centroid(c).Dimension(d) = Record(r).Dimension(d) 
      Next d 

      uniqueCentroid = True 

      For c2 = LBound(Centroid) To c - 1 

       'Loop Through Record Dimensions and check if all are the same 
       x = 0 
       y = 0 
       For d2 = LBound(Centroid(c).Dimension) To _ 
        UBound(Centroid(c).Dimension) 
        x = x + Centroid(c).Dimension(d2)^2 
        y = y + Centroid(c2).Dimension(d2)^2 
       Next d2 

       uniqueCentroid = Not Sqr(x) = Sqr(y) 
       If Not uniqueCentroid Then Exit For 
      Next c2 

     Loop Until uniqueCentroid 

    Next c 
The next step is to calculate each records distance from each centroid and assign the record to the nearest cluster. 

Dim lowestDistance As Double檢查唯一性通過測量其尺寸距離 - 的變量lowestDistance持有記錄和質心之間測量迄今針對隨後的測量評價的最短距離。 Dim lastCluster As Integer - 在創建任何新分配之前,lastCluster變量保存記錄分配的簇,並用於評估是否已達到穩定性。 Dim ClustersStable As Boolean - 羣集分配和重心重新計算相重複,直到ClustersStable = true.

昏暗lowestDistance爲雙 昏暗lastCluster作爲整數 昏暗ClustersStable由於布爾

做「雖然集羣並不穩定

PassCounter = PassCounter + 1 
ClustersStable = True  'Until Proved otherwise 

'Loop Through Records 

對於R = LBOUND(記錄)UBound函數(記錄)

lastCluster = Record(r).Cluster 
    lowestDistance = 0  'Reset lowest distance 

    'Loop through record distances to centroids 
    For c = LBound(Centroid) To UBound(Centroid) 

     '====================================================== 
     '   Calculate Euclidean Distance 
     '====================================================== 
     ' d(p,q) = Sqr((q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2) 
     '------------------------------------------------------ 
     ' X = (q1 - p1)^2 + (q2 - p2)^2 + (q3 - p3)^2 
     ' d(p,q) = X 

     x = 0 
     y = 0 
     'Loop Through Record Dimensions 
     For d = LBound(Record(r).Dimension) To _ 
      UBound(Record(r).Dimension) 
      y = Record(r).Dimension(d) - Centroid(c).Dimension(d) 
      y = y^2 
      x = x + y 
     Next d 

     x = Sqr(x)  'Get square root 

     'If distance to centroid is lowest (or first pass) assign record to centroid cluster. 
     If c = LBound(Centroid) Or x < lowestDistance Then 
      lowestDistance = x 
      'Assign distance to centroid to record 
      Record(r).Distance(c) = lowestDistance 
      'Assign record to centroid 
      Record(r).Cluster = c 
     End If 
    Next c 

    'Only change if true 
    If ClustersStable Then ClustersStable = Record(r).Cluster = lastCluster 

Next r 

一旦將每條記錄分配給一個簇,簇的質心將重新定位到簇的平均座標。質心移動後,重新評估每個記錄最接近的質心,並且迭代該過程直到達到穩定(即,聚類分配不再改變)。

'Move Centroids to calculated cluster average 
     For c = LBound(Centroid) To UBound(Centroid)  'For every cluster 

      'Loop through cluster dimensions 
      For d = LBound(Centroid(c).Dimension) To _ 
       UBound(Centroid(c).Dimension) 

       Centroid(c).Cluster = 0  'Reset nunber of records in cluster 
       Centroid(c).Dimension(d) = 0  'Reset centroid dimensions 

       'Loop Through Records 
       For r = LBound(Record) To UBound(Record) 

        'If Record is in Cluster then 
        If Record(r).Cluster = c Then 
         'Use to calculate avg dimension for records in cluster 

         'Add to number of records in cluster 
         Centroid(c).Cluster = Centroid(c).Cluster + 1 
         'Add record dimension to cluster dimension for later division 
         Centroid(c).Dimension(d) = Centroid(c).Dimension(d) + _ 
                Record(r).Dimension(d) 

        End If 

       Next r 

       'Assign Average Dimension Distance 
       Centroid(c).Dimension(d) = Centroid(c).Dimension(d)/_ 
              Centroid(c).Cluster 
      Next d 
     Next c 

    Loop Until ClustersStable 

    kMeans = (Err.Number = 0) 
End Function 

顯示結果

outputClusters()方法輸出在兩個表中的結果。第一個表包含每個記錄名稱和分配的簇號,第二個表包含質心座標。

Function outputClusters() As Boolean 

    Dim c As Integer  'Centroid Index 
    Dim r As Integer  'Row Index 
    Dim d As Integer  'Dimension Index 

    Dim oSheet As Worksheet 
    On Error Resume Next 

    Set oSheet = addWorksheet("Cluster Analysis", ActiveWorkbook) 

    'Loop Through Records 
    Dim rowNumber As Integer 
    rowNumber = 1 

    'Output Headings 
    With oSheet.Rows(rowNumber) 
     With .Cells(1) 
      .Value = "Row Title" 
      .Font.Bold = True 
      .HorizontalAlignment = xlCenter 
     End With 
     With .Cells(2) 
      .Value = "Centroid" 
      .Font.Bold = True 
      .HorizontalAlignment = xlCenter 
     End With 
    End With 

    'Print by Row 
    rowNumber = rowNumber + 1  'Blank Row 
    For r = LBound(Record) To UBound(Record) 
     oSheet.Rows(rowNumber).Cells(1).Value = Table.Rows(r).Cells(1).Value 
     oSheet.Rows(rowNumber).Cells(2).Value = Record(r).Cluster 
     rowNumber = rowNumber + 1 
    Next r 

    'Print Centroids - Headings 
    rowNumber = rowNumber + 1 
    For d = LBound(Centroid(LBound(Centroid)).Dimension) To UBound(Centroid(LBound(Centroid)).Dimension) 
     With oSheet.Rows(rowNumber).Cells(d) 
      .Value = Table.Rows(1).Cells(d).Value 
      .Font.Bold = True 
      .HorizontalAlignment = xlCenter 
     End With 
    Next d 

    'Print Centroids 
    rowNumber = rowNumber + 1 
    For c = LBound(Centroid) To UBound(Centroid) 
     With oSheet.Rows(rowNumber).Cells(1) 
      .Value = "Centroid " & c 
      .Font.Bold = True 
     End With 
     'Loop through cluster dimensions 
     For d = LBound(Centroid(c).Dimension) To UBound(Centroid(c).Dimension) 
      oSheet.Rows(rowNumber).Cells(d).Value = Centroid(c).Dimension(d) 
     Next d 
     rowNumber = rowNumber + 1 
    Next c 

    oSheet.columns.AutoFit  '//AutoFit columns to contents 

outputClusters_Error: 
    outputClusters = (Err.Number = 0) 
End Function 

這是不可能的,這種類型的輸出將是多大用處,但它正好說明在記錄叢集分配或簇的記錄可以在自己的解決方案進行訪問的方式。

outputClusters()函數使用另一個自定義方法:addWorksheet(),它將工作表添加到指定名稱的指定/活動工作簿。如果已經存在具有相同名稱的工作表,則outputClusters()函數會將附加到工作表名稱的數字添加/增加。該WorksheetExists()功能也被包含在以下:

Function addWorksheet(Name As String, Optional Workbook As Workbook) As Worksheet 
    On Error Resume Next 
    '// If a Workbook wasn't specified, use the active workbook 
    If Workbook Is Nothing Then Set Workbook = ActiveWorkbook 

    Dim Num As Integer 
    '// If a worksheet(s) exist with the same name, add/increment a number after the name 
    While WorksheetExists(Name, Workbook) 
     Num = Num + 1 
     If InStr(Name, " (") > 0 Then Name = Left(Name, InStr(Name, " (")) 
     Name = Name & " (" & Num & ")" 
    Wend 

    '//Add a sheet to the workbook 
    Set addWorksheet = Workbook.Worksheets.Add 

    '//Name the sheet 
    addWorksheet.Name = Name 
End Function 

Public Function WorksheetExists(WorkSheetName As String, Workbook As Workbook) As Boolean 
    On Error Resume Next 
    WorksheetExists = (Workbook.Sheets(WorkSheetName).Name <> "") 
    On Error GoTo 0 
End Function