2013-11-01 24 views
0

我想用正則表達式(python中的re模塊)檢測C多行註釋。如何在python中捕獲分組正則表達式的補全

因此,它應該能夠找到

/* this is my 
first comment it also has a * in it. 
Now I end my first comment */ 
int a = 3; 

/* this is my second 
multiline comment */ 

所以我需要找到使用再這兩個多行註釋。我想做 re.findall(r'exp', string)。表達中應該表達什麼?我試着對分組字符進行補充,如r'\(\*[^(?:\*\))]*\*\),基本上將*)分組並且檢查它的補碼。但這不起作用。

+0

爲什麼你需要使用正則表達式?使用'string'的'find()'或'partition()'函數更容易。 –

+0

嗨@Steve我正在寫一個標記器,它將提取評論作爲標記。我認爲查找和分區功能不夠強大。 – noPE

回答

1

一種可能的方式:

import re 

ccode = '''/* this is my 
first comment it also has a * in it. 
Now I end my first comment */ 
int a = 3; 

/* this is my second 
multiline comment */''' 

for comment in re.findall('/[*].*?[*]/', ccode, re.DOTALL): 
    print comment 

給出:

/* this is my 
first comment it also has a * in it. 
Now I end my first comment */ 
/* this is my second 
multiline comment */ 

話雖如此,如果你正在建設一個解析器可能在一個詞法分析器是更好的第一提取標記,並定義批示解析器中的多令牌結構。

+0

感謝perreal。這工作。我在包括ply的示例實現中檢查了它。 r'/ \ *(。| \ n)*?\ * /'也是這樣:) – noPE

+0

這種情況怎麼樣:/ * func(「*/abc」); */;-) – Artur

+0

@Artur,對,這就是爲什麼使用/編寫解析器是一個更好的主意。 – perreal

0

這是不可行的只是正則表達式 - 你一定要區分這種情況下,例如,你可以創建狀態機:

  • 三合
  • 不斷線
  • /*可以在字符串中被發現 - 那麼,如果openin沒有啓動,相反評論
  • 你可能有/字符串中/是一個字符串

你不會這樣做使用正則表達式。只是狀態機。

我知道你想要python,但我已經做了類似你想要在erl那天,所以她你走了。繼續並轉換爲python。也許這不是最快/最好的,但不夠好:

###################################################################################### 
#### Before going any further perform all 4 stages of preprocessing 
#### described here http://gcc.gnu.org/onlinedocs/cpp/Initial-processing.html 
############################# 1 - break file into lines ############################## 

open FILE, $file or die "file [$file] was not found\n"; 
my @lines = <FILE>; # deletes \r from every line(\n stays on place) 
close FILE; 

################################ 2 - handle trigraphs ################################ 
foreach (@lines) 
{ 
    s!\Q??=\E!#!g; #??= becomes # 
    s#\Q??/\E#\\#g; #??/ becomes \ 
    s#\Q??'\E#^#g; #??' becomes^
    s#\Q??(\E#[#g; #??(becomes [ 
    s#\Q??)\E#]#g; #??) becomes ] 
    s#\Q??!\E#|#g; #??! becomes | 
    s#\Q??<\E#{#g; #??< becomes { 
    s#\Q??>\E#}#g; #??> becomes } 
    s#\Q??-\E#~#g; #??- becomes ~ 
} 

################################ 3 - merge continued lines ########################### 
# everything in C/C++ may be spanned across many lines so we must merge continued 
# lines to handle things correctly 
# we do not delete lines that are merged with preceeding line - we just leave an 
# empty line to preserve overal location of all things which will be needed later 
# to properly report line numbers if we find sth that we are intersted in 

for (my $i = 0; $i <= $#lines; $i++) 
{ 
    # shows where continued line started ie. where to append following continued line(s) 
    state $appendHere; # acts also as an "append indicator" 
    my $continuedLine; 

    # theoretically continued line ends with \ but preprocessors accept \ followed by 
    # one or more whitespaces too so we accept it as well 
    if ($lines[$i] =~ m#\\[ \t\v\f]*$#) # merge with next line/continued line ? 
    { 
     $lines[$i] =~ s#\\[ \t\v\f]*$##; # delete \ with trailing whitespaces if any 
     $continuedLine = 1; 
    } 
    else 
    { 
     $continuedLine = 0; 
    } 

    if (!defined $appendHere) 
    { 
     if ($continuedLine == 1) 
     { 
      # we will append continued lines to $lines[$appendHere] 
      $appendHere = $i; 
     } 
    } 
    else 
    { 
     chomp $lines[$appendHere];    # get rid of \n before appending next 
     chomp $lines[$i];      # get rid of \n before appending next 
     $lines[$appendHere] .= "$lines[$i]\n"; # append current line to previously marked location 
     $lines[$i] = "\n";      # leave only \n in the current line since we want to preserve line numbers 

     if ($continuedLine == 0) # merge next line too? 
     { 
      $appendHere = undef; 
     } 
    } 
} 

#printFileFormatted(); 

######################## 4 - handle comments and strings ###################################### 
# similarly substituting a comment body with a single space may spoil our line numbers so 
# we are just replacing comments with spaces preserving newlines where necessary 

my $state = "out"; 
my $error; 
my $COMMENT_SUBST = ' '; #'@'; 
my $STRING_SUBST = ' '; #'%'; 

ERROR: for (my $line = 0; $line <= $#lines; $line++) 
{ 
    state $hexVal = 0; 
    state $octVal = 0; 
    state $string = ""; 

    my @chars = split //, $lines[$line]; 
    my $newLine = ""; 

    for (my $i = 0; $i <= $#chars; $i++) 
    { 
     my $c = $chars[$i]; 

     if ($state eq 'out') # ---------------------------------------------- 
     { 
      if ($c eq '/') 
      { 
       $state = 'comment?'; 
       $newLine .= $c; 
      } 
      elsif ($c eq '"') 
      { 
       $state = 'string char'; 
       $newLine .= $STRING_SUBST; 
      } 
      else 
      { 
       $newLine .= $c; 
      } 
     } 
     elsif ($state eq 'comment?') # ---------------------------------------------- 
     { 
      if ($c eq '/') 
      { 
       $state = '//comment'; 
       chop $newLine; 
       $newLine .= $COMMENT_SUBST x 2; 
      } 
      elsif ($c eq '*') 
      { 
       $state = '/*comment'; 
       chop $newLine; 
       $newLine .= $COMMENT_SUBST x 2; 
      } 
      else 
      { 
       $state = 'out'; 
       $newLine .= $c; 
      } 
     } 
     elsif ($state eq '//comment') # ---------------------------------------------- 
     { 
      if ($c eq "\n") 
      { 
       $state = 'out'; 
       $newLine .= $c; 
      } 
      else 
      { 
       $newLine .= $COMMENT_SUBST; 
      } 
     } 
     elsif ($state eq '/*comment') # ---------------------------------------------- 
     { 
      if ($c eq '*') 
      { 
       $state = '/*comment end?'; 
       $newLine .= $COMMENT_SUBST; 
      } 
      elsif ($c eq "\n") 
      { 
       $newLine .= $c; 
      } 
      else 
      { 
       $newLine .= $COMMENT_SUBST; 
      } 
     } 
     elsif ($state eq '/*comment end?') # ---------------------------------------------- 
     { 
      if ($c eq '*') 
      { 
       $newLine .= $COMMENT_SUBST; 
      } 
      elsif ($c eq "\n") 
      { 
       $newLine .= $c; 
      } 
      elsif ($c eq '/') 
      { 
       $state = 'out'; 
       $newLine .= $COMMENT_SUBST; 
      } 
      else 
      { 
       $state = '/*comment'; 
       $newLine .= $COMMENT_SUBST; 
      } 
     } 
     elsif ($state eq 'string char') # ---------------------------------------------- 
     { 
      # theoretically ignore "everything" within a string 
      # which may look like "abc\\" = abc\ or "abc\"" = abc" 
      # "abc\" - wrong - no end of string, "abc\\\" wrong again 

      # in order to detect if particular " terminates a string we have to check the whole string 
      # since it cannot be determined just by checking what the previous character was hence 
      # that state machine was created 

      if ($c eq '"') 
      { 
       $state = 'out'; 
       $newLine .= $STRING_SUBST; 
      } 
      elsif ($c eq "\\") 
      { 
       $state = 'string esc seq'; 
       $newLine .= $STRING_SUBST; 
      } 
      elsif ($c eq "\n") 
      { 
       $error = "line [".($line+1)."] - error - a newline within a string\n"; 
       last ERROR; 
      } 
      else 
      { 
       $newLine .= $STRING_SUBST; 
      } 
     } 
     elsif ($state eq 'string esc seq') # ---------------------------------------------- 
     { 
      # simple esc seq \' \" \? \\ \a \b \f \n \r \t \v 
      # oct num  \o \oo \ooo no more than 3 oct digits (o=[0-7]{1,3}) but value must be < than 255 
      # hex num  \xh \xhh \xhhh..... unlimited number of hex digits (h=[0-9a-fA-F]+) but value must be < than 255 

      # in any other esc seq \ will be ignored hence \u=u \p=p \k=k etc 

      if ($c =~ m#^['"\?\\abfhrtv]$#) 
      { 
       $state = 'string char'; 
       $newLine .= $STRING_SUBST x 2; 
      } 
      elsif ($c eq 'x') 
      { 
       $state = 'string hex marker'; 
       $newLine .= $STRING_SUBST; 
      } 
      elsif ($c =~ m#^[0-7]$#) 
      { 
       $state = 'string oct'; 
       $octVal = oct($c); 
       $newLine .= $STRING_SUBST; 
      } 
      elsif ($c eq "\n") 
      { 
       $error = "line [".($line+1)."] - error - a newline within a string\n"; 
       last ERROR; 
      } 
      else # other esc seqences are ignored - usually a warning is issued 
      { 
       $state = 'string char'; 
       $newLine .= $STRING_SUBST x 2; 
      } 
     } 
     elsif ($state eq 'string hex marker') # ---------------------------------------------- 
     { 
      if ($c =~ m#^[0-9a-fA-F]$#) 
      { 
       $state = 'string hex'; 
       $hexVal = hex($c); 
       $newLine .= $STRING_SUBST; 
      } 
      else 
      { 
       $error = "line [".($line+1)."] - error - hex escape sequence not finished\n"; 
       last ERROR; 
      } 
     } 
     elsif ($state eq 'string hex') # ---------------------------------------------- 
     { 
      if ($c =~ m#^[0-9a-fA-F]$#) 
      { 
       $hexVal <<= 4; 
       $hexVal += hex($c); 

       # treat as regular 8bit character sequence - no fancy long chars etc 
       if ($hexVal > 255) 
       { 
        $error = "line [".($line+1)."] - error - hex escape sequence too big for a character\n"; 
        last ERROR; 
       } 

       $newLine .= $STRING_SUBST; 
      } 
      elsif ($c eq '"') 
      { 
       $state = 'out'; 
       $newLine .= $STRING_SUBST; 
       $hexVal = 0; 
      } 
      elsif ($c eq "\n") 
      { 
       $error = "line [".($line+1)."] - error - a newline within a string\n"; 
       last ERROR; 
      } 
      else 
      { 
       $state = 'string char'; 
       $newLine .= $STRING_SUBST; 
       $hexVal = 0; 
      } 
     } 
     elsif ($state eq 'string oct') # ---------------------------------------------- 
     { 
      if ($c =~ m#^[0-7]$#) 
      { 
       $octVal <<= 3; 
       $octVal += oct($c); 

       # treat as regular 8bit character sequence - no fancy long chars etc 
       if ($octVal > 255) 
       { 
        $error = "line [".($line+1)."] - error - oct esc sequence too big for a character\n"; 
        last ERROR; 
       } 

       $newLine .= $STRING_SUBST; 
      } 
      elsif ($c eq "\n") 
      { 
       $error = "line [".($line+1)."] - error - a newline within a string\n"; 
       last ERROR; 
      } 
      elsif ($c eq '"') 
      { 
       $state = 'out'; 
       $newLine .= $STRING_SUBST; 
       $octVal = 0; 
      } 
      else 
      { 
       $state = 'string char'; 
       $newLine .= $STRING_SUBST; 
       $octVal = 0; 
      } 
     } 
     else 
     { 
      $error = "line [".($line+1)."] - error - state machine problem - unknown state\n"; 
      last ERROR; 
     } 

    }#for (my $i = 0; $i <= $#chars; $i++) 

    $lines[ $line ] = $newLine; 
}#for (my $line = 0; $line <= $#lines; $line++) 

if ($error) # errors detected within state machine? 
{ 
    print "$error"; 
    exit(1); 
} 
else # EOF met - check the state 
{ 
    if ($state eq 'out') 
    { 
     # ok no problem 
    } 
    elsif ($state eq 'comment?') 
    { 
     # ok no problem - may be a division after all - not a preproc problem 
    } 
    elsif ($state eq '//comment') 
    { 
     # ok no problem 
    } 
    elsif ($state eq '/*comment') 
    { 
     print "EOF reached within /* */ comment\n"; 
     exit(1); 
    } 
    elsif ($state eq '/*comment end?') 
    { 
     print "EOF reached within /* */ comment\n"; 
     exit(1); 
    } 
    elsif ($state eq 'string char') 
    { 
     print "EOF reached within string\n"; 
     exit(1); 
    } 
    elsif ($state eq 'string esc seq') 
    { 
     print "EOF reached within string\n"; 
     exit(1); 
    } 
    elsif ($state eq 'string hex marker') 
    { 
     print "EOF reached within string\n"; 
     exit(1); 
    } 
    elsif ($state eq 'string hex') 
    { 
     print "EOF reached within string\n"; 
     exit(1); 
    } 
    elsif ($state eq 'string oct') 
    { 
     print "EOF reached within string\n"; 
     exit(1); 
    } 
    else 
    { 
     print "EOF reached and state machine is in unknown state\n"; 
     exit(1); 
    } 
} 
+0

嗨Artur剛剛從一個示例實現ANSI C的示例實現。 r \/\ *(。| \ n)*?\ * /'是否可以:) – noPE

+0

@noPe:這個正則表達式只會在m/one可能在C/C++文件中找到的可能情況的一小部分:試試這個: * func(「*/abc」); * / – Artur

-1

如果你正在寫一個標記,你會包括檢查字符串,所以你的模式不匹配評論這是一個字符串中,然後這種模式將爲你工作: "(/[*][\S\s]*?[*]/)"