2012-12-03 41 views
7

我正在爲Microsoft Office套件構建私人拼寫檢查程序。我正在對拼寫錯誤和他們的潛在修復進行字符串比較,以確定我希望包含哪些更正。VBA中的加權Damerau-Levenshtein

我看着高和低的加權 Damerau - 萊文斯坦公式字符串比較,因爲我想互換,插入,刪除和替換均具有不同的權重,而不是簡單的「1」的權重,所以我可以優先考慮對其他人的一些改正。例如,錯字「agmes」在理論上可以修正爲「遊戲」「年齡」,因爲兩者都只需要一次編輯就可以移動到拼寫正確的單詞,但是我想讓「交換」編輯更低體重,以便「遊戲」將顯示爲首選更正。

我正在使用Excel進行分析,因此我使用的任何代碼都需要在Visual Basic for Applications(VBA)中使用。我能找到的最好的是this example,這看起來不錯,但它是用Java編寫的。我盡我所能轉換,但我遠離專家,可以使用一點幫助!

任何人都可以看看附加的代碼,並幫助我找出什麼是錯的?

謝謝!

編輯:我得到它在我自己的工作。這是VBA中的加權Damerau-Levenshtein公式。它使用Excel的內置數學函數進行一些評估。將錯字與兩種可能的更正進行比較時,首選成本爲成本最低。這是因爲兩次交換的成本必須大於刪除和插入的成本,如果以最低成本(我認爲是理想的)分配交換,這是不可能的。查看凱文的博客,如果你需要更多的信息。

Public Function WeightedDL(source As String, target As String) As Double 

    Dim deleteCost As Double 
    Dim insertCost As Double 
    Dim replaceCost As Double 
    Dim swapCost As Double 

    deleteCost = 1 
    insertCost = 1.1 
    replaceCost = 1.1 
    swapCost = 1.2 

    Dim i As Integer 
    Dim j As Integer 
    Dim k As Integer 

    If Len(source) = 0 Then 
     WeightedDL = Len(target) * insertCost 
     Exit Function 
    End If 

    If Len(target) = 0 Then 
     WeightedDL = Len(source) * deleteCost 
     Exit Function 
    End If 

    Dim table() As Double 
    ReDim table(Len(source), Len(target)) 

    Dim sourceIndexByCharacter() As Variant 
    ReDim sourceIndexByCharacter(0 To 1, 0 To Len(source) - 1) As Variant 

    If Left(source, 1) <> Left(target, 1) Then 
     table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost)) 
    End If 

    sourceIndexByCharacter(0, 0) = Left(source, 1) 
    sourceIndexByCharacter(1, 0) = 0 

    Dim deleteDistance As Double 
    Dim insertDistance As Double 
    Dim matchDistance As Double 

    For i = 1 To Len(source) - 1 

     deleteDistance = table(i - 1, 0) + deleteCost 
     insertDistance = ((i + 1) * deleteCost) + insertCost 

     If Mid(source, i + 1, 1) = Left(target, 1) Then 
      matchDistance = (i * deleteCost) + 0 
     Else 
      matchDistance = (i * deleteCost) + replaceCost 
     End If 

     table(i, 0) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance) 
    Next 

    For j = 1 To Len(target) - 1 

     deleteDistance = table(0, j - 1) + insertCost 
     insertDistance = ((j + 1) * insertCost) + deleteCost 

     If Left(source, 1) = Mid(target, j + 1, 1) Then 
      matchDistance = (j * insertCost) + 0 
     Else 
      matchDistance = (j * insertCost) + replaceCost 
     End If 

     table(0, j) = Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance) 
    Next 

    For i = 1 To Len(source) - 1 

     Dim maxSourceLetterMatchIndex As Integer 

     If Mid(source, i + 1, 1) = Left(target, 1) Then 
      maxSourceLetterMatchIndex = 0 
     Else 
      maxSourceLetterMatchIndex = -1 
     End If 

     For j = 1 To Len(target) - 1 

      Dim candidateSwapIndex As Integer 
      candidateSwapIndex = -1 

      For k = 0 To UBound(sourceIndexByCharacter, 2) 
       If sourceIndexByCharacter(0, k) = Mid(target, j + 1, 1) Then candidateSwapIndex = sourceIndexByCharacter(1, k) 
      Next 

      Dim jSwap As Integer 
      jSwap = maxSourceLetterMatchIndex 

      deleteDistance = table(i - 1, j) + deleteCost 
      insertDistance = table(i, j - 1) + insertCost 
      matchDistance = table(i - 1, j - 1) 

      If Mid(source, i + 1, 1) <> Mid(target, j + 1, 1) Then 
       matchDistance = matchDistance + replaceCost 
      Else 
       maxSourceLetterMatchIndex = j 
      End If 

      Dim swapDistance As Double 

      If candidateSwapIndex <> -1 And jSwap <> -1 Then 

       Dim iSwap As Integer 
       iSwap = candidateSwapIndex 

       Dim preSwapCost 
       If iSwap = 0 And jSwap = 0 Then 
        preSwapCost = 0 
       Else 
        preSwapCost = table(Application.Max(0, iSwap - 1), Application.Max(0, jSwap - 1)) 
       End If 

       swapDistance = preSwapCost + ((i - iSwap - 1) * deleteCost) + ((j - jSwap - 1) * insertCost) + swapCost 

      Else 
       swapDistance = 500 
      End If 

      table(i, j) = Application.Min(Application.Min(Application.Min(deleteDistance, insertDistance), matchDistance), swapDistance) 

     Next 

     sourceIndexByCharacter(0, i) = Mid(source, i + 1, 1) 
     sourceIndexByCharacter(1, i) = i 

    Next 

    WeightedDL = table(Len(source) - 1, Len(target) - 1) 

End Function 
+0

您可以發佈這樣的一個答案,以自己的qusetion,使其離開該未答覆的隊列? – JustinJDavies

回答

1

我可以看到你已經回答了這個自己:我寫了一個改進的Levenshtein編輯距離算法地址的幾年前的匹配:

http://hairyears.livejournal.com/115867.html 

...但是這並沒有執行並且'共同字符串'的方法足以完成任務:

http://excellerando.blogspot.com/2010/03/vlookup-with-fuzzy-matching-to-get.html 

該代碼可能需要重新測試和重新工作。

看你的代碼,如果你想重溫它,這裏的速度提示

 
Dim arrByte() As Byte 
Dim byteChar As Byte 

arrByte = strSource 

for i = LBound(arrByte) To UBound(arrByte) Step 2 
    byteChar = arrByte(i)     ' I'll do some comparison operations using integer arithmetic on the char 
Next i 

字符串處理在VBA是可怕的慢,即使你使用Mid $()而不是Mid(),但是數值操作非常好:字符串實際上是字節數組,而編譯器將以面值接受。

循環中2的「step」是跳過unicode字符串中的高位字節 - 你是可能是在plain-vanilla ASCII文本上運行字符串比較,你會看到(例如)「ABCd」的字節數組是(00,65,00,66,00,77,100,100)。西歐國家的大部分拉丁字母 - 重音符號,變音符號,dipthongs和所有 - 都將在255以下,並且不會冒險進入在該示例中顯示爲零的更高位字節。

你會擺脫它在希伯來文,希臘文,俄文和阿拉伯文嚴格的單語字符串比較,因爲高字節每個字母中恆:希臘「αβγδ」是字節數組(03,12,03 ,12,03,12,03,12)。然而,這是草率的編碼,它會咬你(或字節)你的屁股,當你嘗試跨語言的字符串比較。它永遠不會用東方字母飛行。

0

相信這些線錯誤: -

deleteDistance = table(0, j - 1) + insertCost 
insertDistance = ((j + 1) * insertCost) + deleteCost 

認爲應該: -

deleteDistance = ((j + 1) * insertCost) + deleteCost 
insertDistance = table(0, j - 1) + insertCost 

有沒有通過代碼去上班了正在發生的事情,不過下面是奇數!

If Left(source, 1) <> Left(target, 1) Then 
    table(0, 0) = Application.Min(replaceCost, (deleteCost + insertCost)) 
End If 

當你需要更換,刪除或插入它可能應該是: -

If Left(source, 1) <> Left(target, 1) Then 
    table(0, 0) = Application.Min(replaceCost, Application.Min(deleteCost, insertCost)) 
End If