2013-12-10 25 views
0

我目前正在編寫一個工具來分析COBOL代碼。爲此,我需要一個正則表達式來分隔單詞,而我在正則表達式中很糟糕。正則表達式來分析COBOL代碼

我發現了以下,它適用於大多數情況,但不是全部。

string[] words = Regex.Split(line, @"[^\p{L}]*\p{Z}[^\p{L}]*"); 

這個問題是它需要像ARG-1這樣的字段,只能返回ARG。它也不會將諸如MY-TABLE(WS-INDEX)之類的東西分隔爲MY-TABLE和WS-INDEX。任何幫助指引我在正確的方向將不勝感激。

更新

感謝所有的洞察力。我完成了我一直在尋找有:

string[] words = Regex.Split(line, @"\s+"); 

,然後我用contains()方法,看看其中是否有一個表項如進一步檢查各話

MY-TEST-TABLE(WS-INDEX) 

如果他們這樣做我substring他們得到2件。

謝謝大家。

+0

是否真的有可能使用正則表達式而不是解析語法的詞法分析器? –

+3

正則表達式是這個工作的錯誤工具。你想找到一個COBOL解析器。 –

+2

如果你在正則表達式上不好,你不應該試圖構建COBOL分析工具。你只是沒有準備好。 ...你不會說你想要分析COBOL代碼;如果您想分析代碼中比「列表標識符或註釋」更有趣的任何內容,則需要一個真正的COBOL解析器。 1)正則表達式不能完整的COBOL解析器,b)完整的COBOL解析器是很多工作要做的。 –

回答

1

正則表達式不是分析COBOL語法的正確工具;但是在將輸入文本分割爲標記時可以使用它。但即使這個更簡單的任務,僅僅是正則表達式還不夠。其他邏輯將是必需的。

根據VS COBOL II grammar Version 1.0.4的標識符(他們稱之爲 「字母 - 用戶定義的字」)的定義如下:

([0-9] + [ - ] [O- 9] [A-ZA-Z] [A-ZA-Z0-9]([ - ] + [A-ZA-Z0-9] +)*

這個定義是複雜的,因爲它確保標識符至少包含一個字母。分裂這個要求可能會被放棄。如果你這樣做,你標識符這個簡單的表達:

[0-9A-ZA-Z] +( - [0-9A-ZA-Z])*

爲了保留分隔符時分裂,只是把分隔成捕獲組(間「(」和「)」):

string input = "MY-TABLE(WS-INDEX)"; 
string[] parts = Regex.Split(input, "([0-9A-Za-z]+(-[0-9A-Za-z])*)"); 

結果將是(不帶引號):

「」
「MY-TABLE」
「(」
「WS-INDEX」
「)」


注意

許多語言語法嵌套結構,其被遞歸地定義。此外,他們有特殊的註釋和字符串轉義等規則,使解析非常困難。正則表達式可以解析這樣的結構(見Regular Expression Recursion and Matching Balanced Constructs),但是正則表達式變得非常複雜並且很難理解,因爲你必須將語言的整個語法壓縮成一個單一的正則表達式。就好像您試圖將C#應用程序編寫爲單個語句一樣。改用工具專用工具如Irony - .NET Language Implementation KitCoco/R

+1

簡單的正則表達式在處理列號約束,字符串文字(尤其是連續的字符串文字)時會失敗,這是您在標記語言時必須面對的問題。 –

2

Jeff A;

看看在http://sourceforge.net/p/open-cobol/code/HEAD/tree/trunk/gnu-cobol/cobc的GNU Cobol的解析代碼說開始http://sourceforge.net/p/open-cobol/code/HEAD/tree/trunk/gnu-cobol/cobc/scanner.l

該目錄中,尤其是.L詞彙文件包含一些正則表達式,但配套的形式語法的上下文時聯合.y野牛文件。

或者,爲了更快速地獲得反饋,請嘗試Koopa Cobol Parser,將http://koopa.sourceforge.net/船作爲.jar。

或者,對於窮人的看法,在Pygments來做/詞法分析器/ compiled.py

class CobolLexer(RegexLexer): 
    """ 
    Lexer for GNU Cobol code. 

    *New in Pygments 1.6.* 
    """ 
    name = 'COBOL' 
    aliases = ['cobol'] 
    filenames = ['*.cob', '*.COB', '*.cpy', '*.CPY'] 
    mimetypes = ['text/x-cobol'] 
    flags = re.IGNORECASE | re.MULTILINE 

    # Data Types: by PICTURE and USAGE 
    # Operators: **, *, +, -, /, <, >, <=, >=, =, <> 
    # Logical (?): NOT, AND, OR 

    # Reserved words: 
    # http://opencobol.add1tocobol.com/gnucobol/#reserved-words 
    # Intrinsics: 
    # http://opencobol.add1tocobol.com/gnucobol/#does-gnu-cobol-implement-any-intrinsic-functions 

    tokens = { 
     'root': [ 
      include('comment'), 
      include('strings'), 
      include('core'), 
      include('nums'), 
      (r'[a-z0-9]([_a-z0-9\-]*[a-z0-9]+)?', Name.Variable), 
    #  (r'[\s]+', Text), 
      (r'[ \t]+', Text), 
     ], 
     'comment': [ 
      (r'(^.{6}[*/].*\n|^.{6}|\*>.*\n)', Comment), 
     ], 
     'core': [ 
      # Figurative constants 
      #(r'(^|(?<=[^0-9a-z_\-]))(ALL\s+)?' 
      (r'\b(?!-)(ALL\s+)?' 
      r'((ZEROES)|(HIGH-VALUE|LOW-VALUE|NULL|QUOTE|SPACE|ZERO)(S)?)' 
      r'\b(?!-)', 
      #r'\s*($|(?=[^0-9a-z_\-]))', 
      Name.Constant), 

      # Reserved words STATEMENTS and other bolds 
      #(r'(^|(?<=[^0-9a-z_\-]))' 
      (r'\b(?!-)' 
      r'(ACCEPT|ADD|ALLOCATE|CALL|CANCEL|CLOSE|COMPUTE|' 
      r'CONFIGURATION|CONTINUE|' 
      r'DATA|DELETE|DISPLAY|DIVIDE|DIVISION|ELSE|END|END-ACCEPT|' 
      r'END-ADD|END-CALL|END-COMPUTE|END-DELETE|END-DISPLAY|' 
      r'END-DIVIDE|END-EVALUATE|END-IF|END-MULTIPLY|END-OF-PAGE|' 
      r'END-PERFORM|END-READ|END-RETURN|END-REWRITE|END-SEARCH|' 
      r'END-START|END-STRING|END-SUBTRACT|END-UNSTRING|END-WRITE|' 
      r'ENVIRONMENT|EVALUATE|EXIT|FD|FILE|FILE-CONTROL|FOREVER|' 
      r'FREE|FUNCTION-ID|GENERATE|GO|GOBACK|' 
      r'IDENTIFICATION|IF|INITIALIZE|' 
      r'INITIATE|INPUT-OUTPUT|INSPECT|INVOKE|I-O-CONTROL|LINKAGE|' 
      r'LOCAL-STORAGE|MERGE|MOVE|MULTIPLY|OPEN|' 
      r'PERFORM|PROCEDURE|PROGRAM-ID|RAISE|READ|RELEASE|RESUME|' 
      r'RETURN|REWRITE|SCREEN|' 
      r'SD|SEARCH|SECTION|SET|SORT|START|STOP|STRING|SUBTRACT|' 
      r'SUPPRESS|TERMINATE|THEN|UNLOCK|UNSTRING|USE|VALIDATE|' 
      r'WORKING-STORAGE|WRITE)' 
      r'\b(?!-)', Keyword.Reserved), 
      #r'\s*($|(?=[^0-9a-z_\-]))', Keyword.Reserved), 

      # Reserved words 
      #(r'(^|(?<=[^0-9a-z_\-]))' 
      (r'\b(?!-)' 
      r'(ACCESS|ADDRESS|ADVANCING|AFTER|ALL|' 
      r'ALPHABET|ALPHABETIC|ALPHABETIC-LOWER|ALPHABETIC-UPPER|' 
      r'ALPHANUMERIC|ALPHANUMERIC-EDITED|ALSO|ALTER|ALTERNATE|' 
      r'ANY|ARE|AREA|AREAS|ARGUMENT-NUMBER|ARGUMENT-VALUE|AS|' 
      r'ASCENDING|ASSIGN|AT|AUTO|AUTO-SKIP|AUTOMATIC|AUTOTERMINATE|' 
      r'BACKGROUND-COLOR|BASED|BEEP|BEFORE|BELL|' 
      r'BLANK|' 
      r'BLINK|BLOCK|BOTTOM|BY|BYTE-LENGTH|CHAINING|' 
      r'CHARACTER|CHARACTERS|CLASS|CODE|CODE-SET|COL|COLLATING|' 
      r'COLS|COLUMN|COLUMNS|COMMA|COMMAND-LINE|COMMIT|COMMON|' 
      r'CONSTANT|CONTAINS|CONTENT|CONTROL|' 
      r'CONTROLS|CONVERTING|COPY|CORR|CORRESPONDING|COUNT|CRT|' 
      r'CURRENCY|CURSOR|CYCLE|DATE|DAY|DAY-OF-WEEK|DE|DEBUGGING|' 
      r'DECIMAL-POINT|DECLARATIVES|DEFAULT|DELIMITED|' 
      r'DELIMITER|DEPENDING|DESCENDING|DETAIL|DISK|' 
      r'DOWN|DUPLICATES|DYNAMIC|EBCDIC|' 
      r'ENTRY|ENVIRONMENT-NAME|ENVIRONMENT-VALUE|EOL|EOP|' 
      r'EOS|ERASE|ERROR|ESCAPE|EXCEPTION|' 
      r'EXCLUSIVE|EXTEND|EXTERNAL|' 
      r'FILE-ID|FILLER|FINAL|FIRST|FIXED|' 
      r'FOOTING|FOR|FOREGROUND-COLOR|FORMAT|FROM|FULL|FUNCTION|' 
      r'GIVING|GLOBAL|GROUP|' 
      r'HEADING|HIGHLIGHT|I-O|ID|' 
      r'IGNORE|IGNORING|IN|INDEX|INDEXED|INDICATE|' 
      r'INITIAL|INITIALIZED|INPUT|' 
      r'INTO|INTRINSIC|INVALID|IS|JUST|JUSTIFIED|' 
      r'KEY|KEYBOARD|LABEL|' 
      r'LAST|LEADING|LEFT|LENGTH|LIMIT|LIMITS|LINAGE|' 
      r'LINAGE-COUNTER|LINE|LINES|LOCALE|LOCK|' 
      r'LOWLIGHT|MANUAL|MEMORY|MINUS|MODE|' 
      r'MULTIPLE|NATIONAL|NATIONAL-EDITED|NATIVE|' 
      r'NEGATIVE|NEXT|NO|NUMBER|NUMBERS|NUMERIC|' 
      r'NUMERIC-EDITED|OBJECT-COMPUTER|OCCURS|OF|OFF|OMITTED|ON|ONLY|' 
      r'OPTIONAL|ORDER|ORGANIZATION|OTHER|OUTPUT|OVERFLOW|' 
      r'OVERLINE|PACKED-DECIMAL|PADDING|PAGE|PARAGRAPH|' 
      r'PLUS|POSITION|POSITIVE|PRESENT|PREVIOUS|' 
      r'PRINTER|PRINTING|PROCEDURES|' 
      r'PROCEED|PROGRAM|PROMPT|QUOTE|' 
      r'QUOTES|RANDOM|RD|RECORD|RECORDING|RECORDS|RECURSIVE|' 
      r'REDEFINES|REEL|REFERENCE|RELATIVE|REMAINDER|REMOVAL|' 
      r'RENAMES|REPLACING|REPORT|REPORTING|REPORTS|REPOSITORY|' 
      r'REQUIRED|RESERVE|RETURNING|REVERSE-VIDEO|REWIND|' 
      r'RIGHT|ROLLBACK|ROUNDED|RUN|SAME|SCROLL|' 
      r'SECURE|SEGMENT-LIMIT|SELECT|SENTENCE|SEPARATE|' 
      r'SEQUENCE|SEQUENTIAL|SHARING|SIGN|SIGNED|SIGNED-INT|' 
      r'SIGNED-LONG|SIGNED-SHORT|SIZE|SORT-MERGE|SOURCE|' 
      r'SOURCE-COMPUTER|SPECIAL-NAMES|STANDARD|' 
      r'STANDARD-1|STANDARD-2|STATUS|SUM|' 
      r'SYMBOLIC|SYNC|SYNCHRONIZED|TALLYING|TAPE|' 
      r'TEST|THROUGH|THRU|TIME|TIMES|TO|TOP|TRAILING|' 
      r'TRANSFORM|TYPE|UNDERLINE|UNIT|UNSIGNED|' 
      r'UNSIGNED-INT|UNSIGNED-LONG|UNSIGNED-SHORT|UNTIL|UP|' 
      r'UPDATE|UPON|USAGE|USING|VALUE|VALUES|VARYING|WAIT|WHEN|' 
      r'WITH|WORDS|YYYYDDD|YYYYMMDD)' 
      r'\b(?!-)', Keyword.Pseudo), 
      #r'\s*($|(?=[^0-9a-z_\-]))', Keyword.Pseudo), 

      # inactive reserved words 
      #(r'(^|(?<=[^0-9a-z_\-]))' 
      (r'\b(?!-)' 
      r'(ACTIVE-CLASS|ALIGNED|ANYCASE|ARITHMETIC|ATTRIBUTE|B-AND|' 
      r'B-NOT|B-OR|B-XOR|BIT|BOOLEAN|CD|CENTER|CF|CH|CHAIN|CLASS-ID|' 
      r'CLASSIFICATION|COMMUNICATION|CONDITION|DATA-POINTER|' 
      r'DESTINATION|DISABLE|EC|EGI|EMI|ENABLE|END-RECEIVE|' 
      r'ENTRY-CONVENTION|EO|ESI|EXCEPTION-OBJECT|EXPANDS|FACTORY|' 
      r'FLOAT-BINARY-16|FLOAT-BINARY-34|FLOAT-BINARY-7|' 
      r'FORMAT|' 
      r'GET|GROUP-USAGE|IMPLEMENTS|INFINITY|' 
      r'INHERITS|INTERFACE|INTERFACE-ID|INVOKE|LC_ALL|LC_COLLATE|' 
      r'LC_CTYPE|LC_MESSAGES|LC_MONETARY|LC_NUMERIC|LC_TIME|' 
      r'LINE-COUNTER|MESSAGE|METHOD|METHOD-ID|NESTED|NONE|NORMAL|' 
      r'OBJECT|OBJECT-REFERENCE|OPTIONS|OVERRIDE|PAGE-COUNTER|PF|PH|' 
      r'PROPERTY|PROTOTYPE|PURGE|QUEUE|RAISE|RAISING|RECEIVE|' 
      r'RELATION|REPLACE|REPRESENTS-NOT-A-NUMBER|RESET|RESUME|RETRY|' 
      r'RF|RH|SECONDS|SEGMENT|SELF|SEND|SOURCES|STATEMENT|STEP|' 
      r'STRONG|SUB-QUEUE-1|SUB-QUEUE-2|SUB-QUEUE-3|SUPER|SYMBOL|' 
      r'SYSTEM-DEFAULT|TABLE|TERMINAL|TEXT|TYPEDEF|UCS-4|UNIVERSAL|' 
      r'USER-DEFAULT|UTF-16|UTF-8|VAL-STATUS|VALID|VALIDATE|' 
      r'VALIDATE-STATUS)\b(?!-)', Comment), 
      #r'VALIDATE-STATUS)\s*($|(?=[^0-9a-z_\-]))', Comment), 

      # Data Types 
      (r'(^|(?<=[^0-9a-z_\-]))' 
      #(r'\b(?!-)' 
      r'(PIC\s+.+?(?=(\s|\.\s))|PICTURE\s+.+?(?=(\s|\.\s))|' 
      r'(COMPUTATIONAL)(-[1-5X])?|(COMP)(-[1-5X])?|' 
      r'BINARY-C-LONG|POINTER|PROGRAM-POINTER|' 
      r'FUNCTION-POINTER|PROCEDURE-POINTER|' 
      r'BINARY-CHAR|BINARY-DOUBLE|BINARY-LONG|BINARY-SHORT|' 
      r'FLOAT-SHORT|FLOAT-LONG|FLOAT-DECIMAL-16|FLOAT-DECIMAL-34|' 
      r'FLOAT-BINARY-32|FLOAT-BINARY-64|FLOAT-BINARY-128|' 
      r'FLOAT-EXTENDED|FLOAT-DECIMAL-7|' 
      # r'BINARY)\b(?!-)', Keyword.Type), 
      r'BINARY)\s*($|(?=[^0-9a-z_\-]))', Keyword.Type), 

      # Operators 
      (r'(\*\*|\*|\+|-|/|<=|>=|<|>|==|/=|=)', Operator), 

      # (r'(::)', Keyword.Declaration), 

      (r'([(),;:&%.])', Punctuation), 

      # Intrinsics 
      #(r'(^|(?<=[^0-9a-z_\-]))(ABS|ACOS|ANNUITY|ASIN|ATAN|BYTE-LENGTH|' 
      (r'\b(?!-)(ABS|ACOS|ANNUITY|ASIN|ATAN|BYTE-LENGTH|' 
      r'CHAR|COMBINED-DATETIME|CONCATENATE|COS|CURRENT-DATE|' 
      r'DATE-OF-INTEGER|DATE-TO-YYYYMMDD|DAY-OF-INTEGER|DAY-TO-YYYYDDD|' 
      r'EXCEPTION-(?:FILE|LOCATION|STATEMENT|STATUS)|EXP10|EXP|E|' 
      r'FACTORIAL|FRACTION-PART|INTEGER-OF-(?:DATE|DAY|PART)|INTEGER|' 
      r'LENGTH|LOCALE-(?:DATE|TIME(?:-FROM-SECONDS)?)|LOG10|LOG|' 
      r'LOWER-CASE|MAX|MEAN|MEDIAN|MIDRANGE|MIN|MOD|NUMVAL(?:-C)?|' 
      r'ORD(?:-MAX|-MIN)?|PI|PRESENT-VALUE|RANDOM|RANGE|REM|REVERSE|' 
      r'SECONDS-FROM-FORMATTED-TIME|SECONDS-PAST-MIDNIGHT|SIGN|SIN|SQRT|' 
      r'STANDARD-DEVIATION|STORED-CHAR-LENGTH|SUBSTITUTE(?:-CASE)?|' 
      r'SUM|TAN|TEST-DATE-YYYYMMDD|TEST-DAY-YYYYDDD|TRIM|' 
      r'UPPER-CASE|VARIANCE|WHEN-COMPILED|YEAR-TO-YYYY)' 
      r'\b(?!-)', Name.Function), 
      #r'UPPER-CASE|VARIANCE|WHEN-COMPILED|YEAR-TO-YYYY)\s*' 
      #r'($|(?=[^0-9a-z_\-]))', Name.Function), 

      # Booleans 
      #(r'(^|(?<=[^0-9a-z_\-]))(true|false)\s*($|(?=[^0-9a-z_\-]))', Name.Builtin), 
      (r'\b(?!-)(true|false)\b(?!-)', Name.Builtin), 
      # Comparing Operators 
      #(r'(^|(?<=[^0-9a-z_\-]))(equal|equals|ne|lt|le|gt|ge|' 
      # r'greater|less|than|not|and|or)\s*($|(?=[^0-9a-z_\-]))', Operator.Word), 
      (r'\b(?!-)(equal|equals|ne|lt|le|gt|ge|' 
      r'greater|less|than|not|and|or)\b(?!-)', Operator.Word), 
     ], 

     # \"[^\"\n]*\"|\'[^\'\n]*\' 
     'strings': [ 
      # apparently strings can be delimited by EOL if they are continued 
      # in the next line 
      (r'"[^"\n]*("|\n)', String.Double), 
      (r"'[^'\n]*('|\n)", String.Single), 
     ], 

     'nums': [ 
      #(r'\d+(\s+|\.$|$)', Number.Integer), 
      (r'\b(?!-)\d+\b(?!-)', Number.Integer), 
      (r'[+-]?\d*\.\d+([eE][-+]?\d+)?', Number.Float), 
      (r'[+-]?\d+\.\d*([eE][-+]?\d+)?', Number.Float), 
     ], 
    } 


class CobolFreeformatLexer(CobolLexer): 
    """ 
    Lexer for Free format OpenCOBOL code. 

    *New in Pygments 1.6.* 
    """ 
    name = 'COBOLFree' 
    aliases = ['cobolfree'] 
    filenames = ['*.cbl', '*.CBL'] 
    mimetypes = [] 
    flags = re.IGNORECASE | re.MULTILINE 

    tokens = { 
     'comment': [ 
      (r'(\*>.*\n|^\w*\*.*$)', Comment), 
     ], 
    } 

藉口在bitbucket.org的Pygments來做COBOL語法高亮死代碼註釋,擺脫了不少回溯模式匹配,仍在測試中,但尚未提交給bitbucket。這只是在源列表中漂亮的顏色,無智能或正確性