2012-05-29 92 views
2

我使用GSKinner's Reg Exr tool來幫助提出一種模式,可以在包含大量其他垃圾的字段中查找授權號碼。授權號碼是一個包含字母(有時),數字(總是)和連字符(有時)的字符串(,即授權始終包含某處的數字,但不總是包含連字符和字母)。此外,授權號碼可以位於我正在搜索的字段中的任何位置。適當的授權號碼的正則表達式模式提取授權號碼

實例包括:

5555834384734 ' All digits 
12110-AANM  ' Alpha plus digits, plus hyphens 
R-455545-AB-9 ' Alpha plus digits, plus multiple hyphens 
R-45-54A-AB-9 ' Alpha plus digits, plus multiple hyphens 
W892160  ' Alpha plus digits without hypens 

下面是與附加的垃圾,有時附加到以連字符或沒有空間的真實的授權號碼一些示例數據,使它看起來像的部分數。垃圾以可預測的形式/單詞出現:REF,CHEST,IP,AMB,OBV和HOLD不屬於授權號碼的一部分。

5557653700 IP 
R025257413-001 
REF 120407175 
SNK601M71016 
U0504124 AMB 
W892160 
019870270000000 
00Q926K2 
A025229563 
01615217 AMB 
12042-0148 
SNK601M71016 
12096NHP174 
12100-ACDE 
12110-AANM 
12114AD5QIP 
REF-34555 
3681869/OBV ONL 

下面是我使用的模式:我正在學習正則表達式,因此毫無疑問可以提高

"\b[a-zA-Z]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*" 

,但它適用於上述情況,只是不適合以下情況:

REFA5-208-4990IP 'Extract the string 'A5-208-4990'without REF or IP 
OBV1213110379  'Extract the string '1213110379' without the OBV 
5520849900AMB  'Extract the string '5520849900' without AMB 
5520849900CHEST 'Extract the string '5520849900' without CHEST 
5520849900-IP  'Extract the string '5520849900' without -IP 
1205310691-OBV 'Extract the string without the -OBV 
R-025257413-001 'Numbers of this form should also be allowed. 
NO PCT 93660  'If string contains the word NO anywhere, it is not a match 
HOLDA5-208-4990 'If string contains the word HOLD anywhere, it is not a match 

有人可以幫忙嗎?

出於測試目的,這裏的子與樣本輸入數據創建一個表:

Sub CreateTestAuth() 

Dim dbs As Database 
Set dbs = CurrentDb 

With dbs 
    .Execute "CREATE TABLE tbl_test_auth " _ 
     & "(AUTHSTR CHAR);" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('5557653700 IP');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "(' R025257413-001');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('REF 120407175');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('SNK601M71016');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('U0504124 AMB');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('3681869/OBV ONL');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('REFA5-208-4990IP');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('5520849900AMB');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('5520849900CHEST');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('5520849900-IP');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('1205310691-OBV');" 
    .Execute " INSERT INTO tbl_test_auth " _ 
     & "(AUTHSTR) VALUES " _ 
     & "('HOLDA5-208-4990');" 
    .Close 
End With 
End Sub 

回答

0

你的樣品inputfile中(此文件路徑S/B給function<GetMatches>inputFilePath):

5557653700 IP 
R025257413-001 
REF 120407175 
SNK601M71016 
U0504124 AMB 
W892160 
019870270000000 
00Q926K2 
A025229563 
01615217 AMB 
12042-0148 
SNK601M71016 
12096NHP174 
12100-ACDE 
12110-AANM 
12114AD5QIP 
REF-34555 
3681869/OBV ONL 

這裏是保存在文件中的帆船(此文件路徑S/B給出到function<GetMatches>replaceDBPath):

^REF 
IP$ 
^OBV 
AMB$ 
CHEST$ 
-OBV$ 
^.*(NO|HOLD).*$ 

這裏而來的bas

Option Explicit 
'This example uses the following references: 
'Microsoft VBScript Regular Expressions 5.5 and Microsoft Scripting Runtime 

Private fso As New Scripting.FileSystemObject 
Private re As New VBScript_RegExp_55.RegExp 

Private Function GetJunkList(fpath$) As String() 
0  On Error GoTo errHandler 
1  If fso.FileExists(fpath) Then 
2   Dim junkList() As String, mts As MatchCollection, mt As Match, pos&, tmp$ 
3   tmp = fso.OpenTextFile(fpath).ReadAll() 
4   With re 
5    .Global = True 
6    .MultiLine = True 
7    .Pattern = "[^\r\n]+" 
8    Set mts = .Execute(tmp) 
9    ReDim junkList(mts.Count - 1) 
10   For Each mt In mts 
11    junkList(pos) = mt.Value 
12    pos = pos + 1 
13   Next mt 
14  End With 
15  GetJunkList = junkList 
16 Else 
17  MsgBox "File not found at:" & vbCr & fpath 
18 End If 
19 Exit Function 
errHandler: 
    Dim Msg$ 
    With Err 
     Msg = "Error '" & .Number & " " & _ 
     .Description & "' occurred in " & _ 
     "Function<GetJunkList> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".") 
    End With 
    MsgBox Msg, vbCritical 
End Function 

Public Function GetMatches(replaceDBPath$, inputFilePath$) As String() 
0  On Error GoTo errHandler 
1  Dim junks() As String, junkPat$, tmp$, results() As String, pos&, mts As MatchCollection, mt As Match 
2  junks = GetJunkList(replaceDBPath) 
3  tmp = fso.OpenTextFile(inputFilePath).ReadAll 
4 
5  With re 
6  .Global = True 
7  .MultiLine = True 
8  .IgnoreCase = True 
9  For pos = LBound(junks) To UBound(junks) 
10   .Pattern = junkPat 
11   junkPat = junks(pos) 
12   'replace junk with [] 
13   tmp = .Replace(tmp, "") 
14  Next pos 
15 
16  'trim lines [if all input data in one line] 
17  .Pattern = "^[ \t]*|[ \t]*$" 
18  tmp = .Replace(tmp, "") 
19 
20  'create array using provided pattern 
21  pos = 0 
22  .Pattern = "\b[a-z]*[\d]+\-*\d*[a-z0-9]*\b" 
23  Set mts = .Execute(tmp) 
24  ReDim results(mts.Count - 1) 
25  For Each mt In mts 
26   results(pos) = mt.Value 
27   pos = pos + 1 
28  Next mt 
29 End With 
30 
31 GetMatches = results 
32 Exit Function 
errHandler: 
    Dim Msg$ 
    With Err 
     Msg = "Error '" & .Number & " " & _ 
     .Description & "' occurred in " & _ 
     "Function<GetMatches> at line # " & IIf(Erl <> 0, " at line " & CStr(Erl) & ".", ".") 
    End With 
    MsgBox Msg, vbCritical 
End Function 

和樣品測試

Public Sub tester() 
    Dim samples() As String, s 
    samples = GetMatches("C:\Documents and Settings\Cylian\Desktop\junks.lst", "C:\Documents and Settings\Cylian\Desktop\sample.txt") 
    For Each s In samples 
     MsgBox s 
    Next 
End Sub 

可以從immediate window被稱爲:

tester 

希望這有助於。

0

\ B開始的是一個問題。還有一些空間和一些破折號需要照顧。試試這個「[a-zA-Z|\s|-]*[\d]+[-]*[\d]*[A-Za-z0-9]*[\b]*」。僅在授權號碼上運行此操作。

+1

開始處的'\ b'似乎沒問題,因爲示例中的第一個字符總是字母或數字。最後的'[\ b]'是不正確的(它匹配退格字符,而不是字邊界),但'*'使它成爲可選的,所以它根本沒有任何作用。另外,你的'[a-zA-Z | \ s | - ]'應該只是'[a-zA-Z \ s-]'; 「or」在字符類中是自動的,所以'|'匹配一個字符'|'。 –

0

由於額外的過濾,我會使用兩步法。

var splitter = new Regex(@"[\t\n\r]+", RegexOptions.Multiline); 
const string INPUT = @"REFA5-208-4990IP 
     OBV1213110379 
     5520849900AMB 
     5520849900CHEST 
     5520849900-IP 
     1205310691-OBV 
     R-025257413-001 
     NO PCT 93660 
     HOLDA5-208-4990"; 
string[] lines = splitter.Split(INPUT); 

var blacklist = new[] { "NO", "HOLD" }; 
var ignores = new[] { "REF", "IP", "CHEST", "AMB", "OBV" }; 

var filtered = from line in lines 
     where blacklist.All(black => line.IndexOf(black) < 0) 
     select ignores.Aggregate(line, (acc, remove) => acc.Replace(remove, "")); 

var authorization = new Regex(@"\b([a-z0-9]+(?:-[a-z0-9]+)*)\b", RegexOptions.IgnoreCase); 
foreach (string s in filtered) 
{ 
    Console.Write("'{0}' ==> ", s); 
    var match = authorization.Match(s); 
    if (match.Success) 
    { 
    Console.Write(match.Value); 
    } 
    Console.WriteLine(); 
} 

打印:

'A5-208-4990' ==> A5-208-4990 
' 1213110379' ==> 1213110379 
' 5520849900' ==> 5520849900 
' 5520849900' ==> 5520849900 
' 5520849900-' ==> 5520849900 
' 1205310691-' ==> 1205310691 
' R-025257413-001' ==> R-025257413-001 
+0

哎呀,對不起,我忽略了* vba *標記。儘管如此,邏輯保持不變。 :-) – primfaktor

+0

謝謝,primfaktor!你能幫我分解一下你使用的RegEx模式嗎?在測試了一些數據之後,我意識到授權號碼必須至少包含一位數字。你的模式工作得很好,但它匹配所有字母的字符串,_e.g._匹配單詞「是」。該模式如何更新以使其必須至少包含一位數字? – regulus

+0

對不起,我沒有看到這是必要的。正則表達式基本上是這樣說的:「至少找一個alphanum,然後再加上一個用連字符連接的更多的alphanum字符串。「 讓我想一個修復。 – primfaktor

0

有時候是很容易讓它去鬆動,而不是嚴格堅持這種或那種方式。 :)

試試這個:

1 - 添加此功能

Public Function RemoveJunk(ByVal inputValue As String, ParamArray junkWords() As Variant) As String 
    Dim junkWord 
    For Each junkWord In junkWords 
     inputValue = Replace(inputValue, junkWord, "", , , vbBinaryCompare) 
    Next 
    RemoveJunk = inputValue 
End Function 

2 - 現在你的任務是很容易。見下面的例子就如何使用它:

Sub Sample() 
    Dim theText As String 
    theText = " REFA5-208-4990IP blah blah " 
    theText = RemoveJunk(theText, "-REF", "REF", "-IP", "IP", "-OBV", "OBV") '<-- complete this in a similar way 

    Debug.Print theText 

    '' -- now apply the regexp here -- 


End Sub 

的RemoveJunk函數調用的完成是有點棘手。在較短的之前放置較長的那些。例如-OBV應該在「OBV」之前。

試試看看它是否可以解決您的問題。

1

好吧,起初我認爲額外的要求會使正則表達式延長很多
但積極向前看,它實際上幾乎相同的大小。只有正則表達式這個時候:
\b(?=.*\d)([a-z0-9]+(?:-[a-z0-9]+)*)\b

或者分解與評論(忽略空格):

\b      # Word start 
    (?=.*\d)    # A number has to follow somewhere after this point 
    (     # Start capture group 
    [a-z0-9]+   # At least one alphanum 
    (?:-[a-z0-9]+)* # Possibly more attached with hyphen 
)     # End capture group 
\b      # Word end 

然而要注意,寬度變化的先行處理並非所有的正則表達式的口味支持。我不知道VBA的一個。

第二個注意:(?=) thingy也會得到滿足,如果數字出現在單詞結束後。因此在
DONT-RECOGNIZE-ME but-1-5ay-yes
將會捕獲黑體部分。

+0

OT:我對英文和帶有或不帶連字符的單詞不太擅長。那是「請注意......」句子是否正確? – primfaktor