2014-03-03 72 views
1
倒排索引有效方式

我創建一個倒排索引將文字的字典了行號的相關列表出現在字(開始行號,並出現在單詞列表該行內的給定單元格)。創造VBA

我設法得到了一些代碼,這方面的工作,但我發現處理(在字典中的值)添加到陣列是一個有點麻煩,我不知道是有一個更有效或更優雅的方式來處理這個(事情。

我願意使用數組,集合或能夠很容易地搜索到行號的列表存儲在所述字典的值的任何其它數據類型。我已經貼了我的代碼來證明下面的核心問題砍下版本,真正的問題是差不多的BuildInvertedIndex程序,但包括儘量使其更容易重現場景中的其餘部分:

Sub Test() 
' minimum included here to demonstrate use of buildInvertedIndex procedure 

    Dim vRange As Range 
    Dim vDict As Dictionary 

    Set vRange = ActiveSheet.Range("F2:F20585") 
    Set vDict = New Dictionary 

    BuildInvertedIndex vDict, vRange 

    ' test values returned in dictionary (word: [line 1, ..., line n]) 
    Dim k As Variant, vCounter As Long 
    vCounter = 0 
    For Each k In vDict.Keys 
     Debug.Print k & ": " & ArrayToString(vDict.Item(k)) 
     vCounter = vCounter + 1 
     If vCounter >= 10 Then 
      Exit For 
     End If 
    Next 


End Sub 


Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range) 

    Dim cell As Range 
    Dim words As Variant, word As Variant, val As Variant 
    Dim tmpArr() As Long 
    Dim newLen As Long, i As Long 

    ' loop through cells (one col wide so same as looping through lines) 
    For Each cell In pRange.Cells 

     ' loop through words in line 
     words = Split(cell.Value) 
     For Each word In words 

      If Not pDict.exists(word) Then 
       ' start line array with first row number 
       pDict.Add word, Array(cell.Row()) 
      Else 
       i = 0 
       If Not InArray(cell.Row(), pDict.Item(word)) Then 
        newLen = UBound(pDict.Item(word)) + 1 
        ReDim tmpArr(newLen) 
        For Each val In tmpArr 
         If i < newLen Then 
          tmpArr(i) = pDict.Item(word)(i) 
         Else 
          tmpArr(i) = cell.Row() 
         End If 
         i = i + 1 
        Next val 
        pDict.Item(word) = tmpArr 
       End If 
      End If 
     Next word 
    Next cell 

End Sub 


Function ArrayToString(vArray As Variant, _ 
         Optional vDelim As String = ",") As String 
' only included to support test (be able to see what is in the arrays) 

    Dim vDelimString As String 
    Dim i As Long 

    For i = LBound(vArray) To UBound(vArray) 
     vDelimString = vDelimString & CStr(vArray(i)) & _ 
         IIf(vCounter < UBound(vArray), vDelim, "") 
    Next 

    ArrayToString = vDelimString 
End Function 

要運行此操作,需要活動工作表(語句)F列中的值,如果您尚未擁有它,則還需要在VBA環境中添加對Microsoft腳本運行時的引用,以使字典數據類型可用(工具 - >參考 - > Microsoft腳本運行時)。

如你將從此變得有點雜亂,我必須插入新的行號到現有的陣列(即存儲爲字典內的值)的代碼看到。因爲我不知道如何擴展這個數組(不清除現有值),所以我使用了變量tmpArr來創建一個合適大小的數組,然後從字典中的現有數組中逐個拷貝這些值然後將當前行號添加到最後。臨時數組然後用於替換該鍵(當前單詞)的現有值。

任何意見,將不勝感激。

+0

不能直接使用存儲在字典中的數組 - 通常的方法是將其從字典中提取出來,修改它,然後將其重新存儲在同一個插槽中。 EG:http://stackoverflow.com/questions/16447088/adding-to-an-array-in-vba-with-strings-as-the-index/16451081#16451081 –

回答

1

我願意使用數組,集合或任何其他數據類型

正如我看到的,使用集合,而不是陣列將是更simplier:

Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range) 
    Dim cell As Range 
    Dim words, word 
    Dim i As Long  
    ' loop through cells (one col wide so same as looping through lines) 
    For Each cell In pRange.Cells  
     ' loop through words in line 
     words = Split(cell.Value) 
     For Each word In words  
      If Not pDict.Exists(word) Then 
       ' initialize collection 
       pDict.Add word, New Collection 
      End If 
      'try to add to collection. If row is already in collecton, nothing happend. Storing key makes you sure there're only unique rows 
      On Error Resume Next 
      pDict.Item(word).Add Item:=cell.Row, Key:=CStr(cell.Row) 
      On Error GoTo 0     
     Next word 
    Next cell 
End Sub 

下一步,是將稍微修改ArrayToStringColToString

Function ColToString(vCol As Collection, _ 
        Optional vDelim As String = ",") As String 
' only included to support test (be able to see what is in the arrays) 

    Dim vDelimString As String 
    Dim i As Long 

    For i = 1 To vCol.Count 
     vDelimString = vDelimString & CStr(vCol.Item(i)) & _ 
         IIf(i < vCol.Count, vDelim, "") 
    Next 

    ColToString = vDelimString 
End Function 

和測試子程序(改變只有一行 - Debug.Print k & ": " & ColToString(vDict.Item(k))和目標範圍"F2:F5"):

Sub Test() 
' minimum included here to demonstrate use of buildInvertedIndex procedure 

    Dim vRange As Range 
    Dim vDict As Dictionary 

    Set vRange = ActiveSheet.Range("F2:F5") 
    Set vDict = New Dictionary 

    BuildInvertedIndex vDict, vRange 

    ' test values returned in dictionary (word: [line 1, ..., line n]) 
    Dim k As Variant, vCounter As Long 
    vCounter = 0 
    For Each k In vDict.Keys 
     Debug.Print k & ": " & ColToString(vDict.Item(k)) 
     vCounter = vCounter + 1 
     If vCounter >= 10 Then 
      Exit For 
     End If 
    Next 

    'clean up memory 
    Set vDict = Nothing 
End Sub 

結果:

enter image description here


UPDATE:

爲了提高你的合作速度德你可以存儲在陣列範圍(另一個方法的工作只與單柱範圍,但你可以很容易地修改它):

測試子:

Sub TestWirhArray() 
' minimum included here to demonstrate use of buildInvertedIndex procedure 

    Dim vRange As Range 
    Dim vDict As Dictionary 
    Dim myArr As Variant 

    Set vDict = New Dictionary 
    Set vRange = ActiveSheet.Range("F2:F20585") 
    myArr = vRange.Value 
    BuildInvertedIndexWithArr vDict, myArr, vRange.Row 

    ' test values returned in dictionary (word: [line 1, ..., line n]) 
    Dim k As Variant, vCounter As Long 
    vCounter = 0 
    For Each k In vDict.Keys 
     Debug.Print k & ": " & ColToString(vDict.Item(k)) 
     vCounter = vCounter + 1 
     If vCounter >= 10 Then 
      Exit For 
     End If 
    Next 

    'clean up memory 
    Set vDict = Nothing 
End Sub 

新版本的BuildInvertedIndexWithArr

Sub BuildInvertedIndexWithArr(pDict As Dictionary, pArr, firstRow As Long) 
    Dim cell, words, word 
    Dim i As Long, j As Long 

    j = firstRow 
    ' loop through cells (one col wide so same as looping through lines) 
    For Each cell In pArr 

     ' loop through words in line 
     words = Split(cell) 
     For Each word In words 

      If Not pDict.exists(word) Then 
       ' initialize collection 
       pDict.Add word, New Collection 
      End If 

      On Error Resume Next 
      pDict.Item(word).Add Item:=j, Key:=CStr(j) 
      On Error GoTo 0 

     Next word 
     j = j + 1 
    Next cell 
End Sub 
+1

感謝您的詳細回覆,只是嘗試一下現在。 – ChrisProsser

+1

太棒了,謝謝。它不僅看起來更乾淨,而且速度提高了10倍以上。在超過20k的行上,計時是:'在11327.0713492417ms內使用數組作爲字典中的值,爲20585行建立索引。'和'在727.502338194183ms內使用數組作爲字典中值的20585行的建立索引。' – ChrisProsser

+1

這很酷:)一個更多的提示如何加快你的代碼:而是使用範圍,你可以將它存儲在數組中(它應該給你〜10-20倍的速度提高)。我已經用這種方法更新了我的內容。你可以嘗試一下,現在就說出時間,只是好奇:)。 –