2016-03-10 141 views
3

我與Excel單元格內,用逗號分割成其組成部分的一些英國的地址數據的工作。Excel中 - 字符串刪除重複

我有一些VBA我已經從已經除去了一些精確複製的條目的網絡中獲得,但我留下大量已重複段一些順序和一些非順序數據。

附件是圖像強調什麼,我試圖實現,但迄今爲止這是不是我包括我使用的代碼向您展示中,我一直在尋找的方向。任何人都可以進一步思考如何實現這一目標?

Function stringOfUniques(inputString As String, delimiter As String) 
Dim xVal As Variant 
Dim dict As Object 
Set dict = CreateObject("Scripting.Dictionary") 

For Each xVal In Split(inputString, delimiter) 
dict(xVal) = xVal 
Next xVal 

stringOfUniques = Join(dict.Keys(), ",") 
End Function 

這確實設法擺脫了一些他們,但有一個龐大的人口,我正在努力如此自動化這將是不可思議的。

Ideal Outcome

+0

一個RegExp與反向引用將是另一個可能的選擇 – brettdj

回答

4

可能不是最優雅的答案,但確實的伎倆。 這裏我使用Split命令在每個逗號處分割字符串。 從這個返回的結果是

bat ball banana 

代碼:

Option Explicit 
Private Sub test() 
Dim Mystring As String 
Dim StrResult As String 

Mystring = "bat,ball,bat,ball,banana" 
StrResult = shed_duplicates(Mystring) 
End Sub 
Private Function shed_duplicates(ByRef Mystring As String) As String 
Dim MySplitz() As String 
Dim J As Integer 
Dim K As Integer 
Dim BooMatch As Boolean 
Dim StrTemp(10) As String ' assumes no more than 10 possible splits! 
Dim StrResult As String 


MySplitz = Split(Mystring, ",") 
    For J = 0 To UBound(MySplitz) 
    BooMatch = False 
    For K = 0 To UBound(StrTemp) 
     If MySplitz(J) = StrTemp(K) Then 
      BooMatch = True 
      Exit For 
     End If 
    Next K 
    If Not BooMatch Then 
     StrTemp(J) = MySplitz(J) 
    End If 
Next 
For J = 0 To UBound(StrTemp) 
    If Len(StrTemp(J)) > 0 Then ' ignore blank entries 
     StrResult = StrResult + StrTemp(J) + " " 
    End If 
Next J 
Debug.Print StrResult 
End Function 
3

你可能真的使用正則表達式替換:

^(\d*\s*([^,]*),.*)\2(,|$) 

替換模式是

$1$3 

請參閱regex demo。所述圖案解釋

  • ^ - 串的開始(或線的如果.MultiLine = True
  • (\d*\s*([^,]*),.*) - 第1組(稍後參考與從替換模式$1反向引用)匹配:
    • \d* - 0+位數字與
    • \s* - 0+空格字符
    • ([^,]*) - 第2組(以後,我們可以使用在\2圖案反向引用來引用與該子模式捕捉到的值)相匹配比逗號
    • ,.*其他0+字符 - 逗號後具有比其他換行符0+字符
  • \2 - 由組2
  • (,|$)捕獲的文本 - 第3組(稍後從替換模式參照與$3 - 還原逗號)匹配逗號或字符串的末尾(或行,如果.MultiLine = True)。

注意:你不需要.MultiLine = True如果你只是檢查單個細胞含有一個地址。

下面是展示如何在VBA中使用的樣本VBA子:

Sub test() 
    Dim regEx As Object 
    Set regEx = CreateObject("VBScript.RegExp") 
    With regEx 
     .pattern = "^(\d*\s*([^,]*),.*)\2(,|$)" 
     .Global = True 
     .MultiLine = True ' Remove if individual addresses are matched 
    End With 
    s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _ 
     "9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _ 
     "35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _ 
     "27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD" 
    MsgBox regEx.Replace(s, "$1$3") 
End Sub 

enter image description here

+0

偉大的工作!請注意,在正則表達式替換後刪除兩倍逗號的小調整是值得的。 – brettdj

+1

我從未見過正則表達式函數,示例解決方案! –

1

第一種解決方案是使用一個字典來獲得獨特的段列表。 它隨後將被作爲分裂段之前跳過第一地址數作爲簡單:

Function RemoveDuplicates1(text As String) As String 
    Static dict As Object 
    If dict Is Nothing Then 
    Set dict = CreateObject("Scripting.Dictionary") 
    dict.CompareMode = 1 ' set the case sensitivity to All 
    Else 
    dict.RemoveAll 
    End If 

    ' Get the position just after the address number 
    Dim c&, istart&, segment 
    For istart = 1 To Len(text) 
    c = Asc(Mid$(text, istart, 1)) 
    If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ] 
    Next 

    ' Split the segments and add each one of them to the dictionary. No need to keep 
    ' a reference to each segment since the keys are returned by order of insertion. 
    For Each segment In Split(Mid$(text, istart), ",") 
    If Len(segment) Then dict(segment) = Empty 
    Next 

    ' Return the address number and the segments by joining the keys 
    RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",") 
End Function 

第二種解決方案將是,以提取所有的段,然後搜索如果它們中的每一個存在於先前的位置是:

Function RemoveDuplicates2(text As String) As String 
    Dim c&, segments$, segment$, length&, ifirst&, istart&, iend& 

    ' Get the position just after the address number 
    For ifirst = 1 To Len(text) 
    c = Asc(Mid$(text, ifirst, 1)) 
    If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ] 
    Next 

    ' Get the segments without the address number and add a leading/trailing comma 
    segments = "," & Mid$(text, ifirst) & "," 
    istart = 1 

    ' iterate each segment 
    Do While istart < Len(segments) 

    ' Get the next segment position 
    iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF 
    If iend - istart Then 

     ' Get the segment 
     segment = Mid$(segments, istart, iend - istart + 2) 

     ' Rewrite the segment if not present at a previous position 
     If InStr(1, segments, segment, vbTextCompare) = istart Then 
     Mid$(segments, length + 1) = segment 
     length = length + Len(segment) - 1 
     End If 
    End If 

    istart = iend + 1 
    Loop 

    ' Return the address number and the segments 
    RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1) 

End Function 

和第三解決方案將是使用正則表達式來除去所有的重複鏈段:

Function RemoveDuplicates3(ByVal text As String) As String 

    Static re As Object 
    If re Is Nothing Then 
    Set re = CreateObject("VBScript.RegExp") 
    re.Global = True 
    re.IgnoreCase = True 
    ' Match any duplicated segment separated by a comma. 
    ' The first segment is compared without the first digits. 
    re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)" 
    End If 

    ' Remove each matching segment 
    Do While re.test(text) 
    text = re.Replace(text, "$1") 
    Loop 

    RemoveDuplicates3 = text 
End Function 

這些都是對於10000次迭代(越低越好),執行時間:

input text : "123 abc,,1 abc,abc 2,ABC,abc,a,c" 
output text : "123 abc,1 abc,abc 2,a,c" 

RemoveDuplicates1 (dictionary) : 718 ms 
RemoveDuplicates2 (text search) : 219 ms 
RemoveDuplicates3 (regex)  : 1469 ms 
+0

再一次感謝您的衆多優雅的解決方案,每個似乎都做我需要的東西。不勝感激! –