Python代碼片段刪除C和C++註釋

我正在尋找從代碼中刪除C和C++註釋的Python代碼。（假設字符串包含一個完整的C源文件。）Python代碼片段刪除C和C++註釋

我知道我可以.match（）與正則表達式的子串，但是，這並不解決嵌套/*，或具有一個//內部/* */。

理想情況下，我寧願一個非天真的實現，正確處理尷尬的情況。

2008-10-27 TomZ

爲什麼要在地球上刪除*來自源代碼的評論？ – QuantumPete 2008-10-28 08:09:44

@QuantumPete，以提高可讀性和可理解性。最快的方法是使用着色編輯器並將註釋顏色設置爲等於背景顏色。 – 2009-05-23 00:12:54

@QuantumPete或者因爲我們正在嘗試預處理後續處理器的源代碼，所以不會採取正常的註釋 – 2017-02-06 03:27:32

我不知道你是否熟悉基於UNIX的（但是Windows可用的）文本解析程序sed，但是我找到了一個sed腳本here，它將刪除文件中的C/C++註釋。它非常聰明;例如，它將忽略「//」和「/ *」，如果發現在字符串聲明等。從內Python中，它可以用下面的代碼中使用：

import subprocess 
from cStringIO import StringIO 

input = StringIO(source_code) # source_code is a string with the source code. 
output = StringIO() 

process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'], 
    input=input, output=output) 
return_code = process.wait() 

stripped_code = output.getvalue()

在這個程序中，source_code是持有C/C++源代碼的變量，最終stripped_code將保留刪除了註釋的C/C++代碼。當然，如果你在磁盤上有文件，你可以讓變量input和output成爲指向這些文件的文件句柄（input處於讀取模式，output處於寫入模式）。 remccoms3.sed是上述鏈接中的文件，應將其保存在磁盤上的可讀位置。 sed也可在Windows上使用，並且默認安裝在大多數GNU/Linux發行版和Mac OS X上。

這可能會比純Python解決方案更好;不需要重新發明輪子。

來源

2008-10-28 04:03:20 zvoase

C（和C++）註釋不能嵌套。正則表達式工作得很好：

//.*?\n|/\*.*?\*/

這需要「單線」標誌（Re.S），因爲C註釋可以跨越多行。

def stripcomments(text): 
    return re.sub('//.*?\n|/\*.*?\*/', '', text, flags=re.S)

此代碼應該工作。

/編輯：請注意，我的上面的代碼實際上是作出關於行尾的假設！此代碼不適用於Mac文本文件。然而，這可以相對容易地修改爲：

//.*?(\r\n?|\n)|/\*.*?\*/

這個正則表達式應該對所有的文本文件的工作，不管他們的行結束的（包括在Windows，Unix和Mac行結尾）。

/編輯：MizardX和Brian（在評論中）對字符串的處理作了有效評論。我完全忘了這一點，因爲上面的正則表達式是從一個解析模塊中獲取的，該模塊對字符串進行了額外的處理。 MizardX的解決方案應該工作得很好，但它只處理雙引號字符串。

來源

2008-10-27 20:48:57

1.使用`$`和re.MULTILINE而不是`'\ n'，'\ r \ n'， etc – jfs 2008-10-27 21:46:39

這不處理以反斜槓結尾的行的情況，它表示持續的行，但這種情況極爲罕見 – 2008-10-27 22:00:39

您錯過了re.sub中的替換空白字符串。此外，這不適用於字符串。例如。考慮'string uncPath =「// some_path」;'或'char操作符[] =「/ * + - 」;'對於語言解析，我認爲你最好使用真正的解析器。 – Brian 2008-10-27 22:01:33

您可能可以利用py++來解析C++源代碼和GCC。

Py++ does not reinvent the wheel. It uses GCC C++ compiler to parse C++ source files. To be more precise, the tool chain looks like this:

source code is passed to GCC-XML GCC-XML passes it to GCC C++ compiler GCC-XML generates an XML description of a C++ program from GCC's internal representation. Py++ uses pygccxml package to read GCC-XML generated file. The bottom line - you can be sure, that all your declarations are read correctly.

或者可能不是。不管怎樣，這不是一個簡單的解析。

@基於RE的解決方案 - 除非您限制輸入（例如無宏），否則您不可能找到正確處理所有可能的「尷尬」情況的RE。對於一個防彈解決方案，你真的別無選擇，只能利用真正的語法。

來源

2008-10-27 20:50:45

這可以處理C++風格的註釋，C風格的註釋，字符串和簡單的嵌套。

def comment_remover(text): 
    def replacer(match): 
     s = match.group(0) 
     if s.startswith('/'): 
      return " " # note: a space and not an empty string 
     else: 
      return s 
    pattern = re.compile(
     r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"', 
     re.DOTALL | re.MULTILINE 
    ) 
    return re.sub(pattern, replacer, text)

需要包含字符串，因爲它們內部的註釋標記不會發起註釋。

編輯：應用re.sub沒有采取任何標誌，所以不得不先編譯模式。

編輯2：添加了字符文字，因爲它們可以包含可能被識別爲字符串分隔符的引號。

EDIT3：修正了法律表達int/**/x=5;將成爲intx=5;這將不能編譯，用空格而不是一個空字符串替換註釋的情況。

來源

2008-10-27 21:48:07

不要忘記在C中，在處理註釋之前消除了反斜線新行，並且在此之前處理了三角形（因爲?? /是反斜槓的三字形）。我有稱爲SCC（條C/C++評論）一個C程序，這裏是測試代碼的一部分...

" */ /* SCC has been trained to know about strings /* */ */"! 
"\"Double quotes embedded in strings, \\\" too\'!" 
"And \ 
newlines in them" 

"And escaped double quotes at the end of a string\"" 

aa '\\ 
n' OK 
aa "\"" 
aa "\ 
\n" 

This is followed by C++/C99 comment number 1. 
// C++/C99 comment with \ 
continuation character \ 
on three source lines (this should not be seen with the -C fla 
The C++/C99 comment number 1 has finished. 

This is followed by C++/C99 comment number 2. 
/\ 
/\ 
C++/C99 comment (this should not be seen with the -C flag) 
The C++/C99 comment number 2 has finished. 

This is followed by regular C comment number 1. 
/\ 
*\ 
Regular 
comment 
*\ 
/
The regular C comment number 1 has finished. 

/\ 
\/ This is not a C++/C99 comment! 

This is followed by C++/C99 comment number 3. 
/\ 
\ 
\ 
/But this is a C++/C99 comment! 
The C++/C99 comment number 3 has finished. 

/\ 
\* This is not a C or C++ comment! 

This is followed by regular C comment number 2. 
/\ 
*/ This is a regular C comment *\ 
but this is just a routine continuation *\ 
and that was not the end either - but this is *\ 
\ 
/
The regular C comment number 2 has finished. 

This is followed by regular C comment number 3. 
/\ 
\ 
\ 
\ 
* C comment */

此沒有示出三字母組合。請注意，在一行的末尾可以有多個反斜槓，但行拼接並不關心有多少個反斜槓，但後續的處理可能會發生。等編寫一個正則表達式來處理所有這些情況將是非平凡的（但這不同於不可能的）。

來源

2008-10-28 02:57:06

在某些情況下正則表達式的情況會下降，例如字符串文字包含與註釋語法相匹配的子序列。你真的需要一個分析樹來處理這個問題。

來源

2008-10-28 02:58:24

你並不需要一個完美的解析樹來完成這個任務，但實際上你需要的令牌流等同於編譯器前端生成的東西。這樣的標記流必須處理所有奇怪的問題，例如行續註釋開始，字符串註釋開始，三角標準化等。如果擁有令牌流，刪除註釋很容易。（我有一個工具可以產生完全這樣的標記流，就像猜測一下，真正的解析器的前端，它產生一個真正的解析樹:)。

令牌被正則表達式單獨識別的事實表明，您原則上可以編寫一個正則表達式來挑選評論詞位。分詞器設置的正則表達式（至少是我們寫的）的真正複雜性表明你不能在實踐中這樣做;單獨寫他們是很難的。如果你不想完美地做到這一點，那麼，上面的大多數RE解決方案都很好。

現在，爲什麼你會想除去我的條評論，除非你正在構建代碼混淆器。在這種情況下，你必須完全正確。

來源

2009-07-03 08:38:46

對不起，這不是一個Python解決方案，但你也可以使用一個理解如何刪除註釋的工具，比如你的C/C++預處理器。以下是GNU CPP does it的方法。

cpp -fpreprocessed foo.c

來源

2009-07-03 09:08:24 sigjuice

還有一個非Python的答案：使用該程序stripcmt：

StripCmt is a simple utility written in C to remove comments from C, C++, and Java source files. In the grand tradition of Unix text processing programs, it can function either as a FIFO (First In - First Out) filter or accept arguments on the commandline.

來源

2009-08-18 14:18:07 hlovdal

-1

我最近遇到這個問題跑的時候我參加了一個培訓班裏，教授要求我們從我們的源剝離的javadoc代碼在提交給他進行代碼審查之前。我們必須多次這樣做，但是我們不能永久刪除javadoc，因爲我們還需要生成javadoc html文件。這是我製作的一個小蟒蛇腳本。由於javadoc以/ **開始並以* /結尾，腳本會查找這些標記，但可以修改腳本以滿足您的需要。它還處理單行塊註釋和塊註釋結束的情況，但在塊註釋結束的同一行上仍有未註釋的代碼。我希望這有幫助！

警告：此腳本修改傳入的文件的內容並將它們保存到原始文件。這將是明智的，有一個備份其他

#!/usr/bin/python 
""" 
A simple script to remove block comments of the form /** */ from files 
Use example: ./strip_comments.py *.java 
Author: holdtotherod 
Created: 3/6/11 
""" 
import sys 
import fileinput 

for file in sys.argv[1:]: 
    inBlockComment = False 
    for line in fileinput.input(file, inplace = 1): 
     if "/**" in line: 
      inBlockComment = True 
     if inBlockComment and "*/" in line: 
      inBlockComment = False 
      # If the */ isn't last, remove through the */ 
      if line.find("*/") != len(line) - 3: 
       line = line[line.find("*/")+2:] 
      else: 
       continue 
     if inBlockComment: 
      continue 
     sys.stdout.write(line)

來源

2011-03-07 16:10:54 slottermoser

某處此公告提供了改進馬庫斯Jarderot的代碼是由atikat描述，在對馬庫斯Jarderot的發帖評論的編碼出的版本。（感謝提供原始代碼，這爲我節省了大量工作。）

爲了更充分地描述改進：改進保持行號完好無損。（這是通過保持換行符完好通過其C/C++註釋被替換字符串完成的。）

當你想生成錯誤消息給用戶這樣做的C/C++評論去除功能的版本適用（例如解析錯誤）包含行號（即行號對原始文本有效）。

import re 

def removeCCppComment(text) : 

    def blotOutNonNewlines(strIn) : # Return a string containing only the newline chars contained in strIn 
     return "" + ("\n" * strIn.count('\n')) 

    def replacer(match) : 
     s = match.group(0) 
     if s.startswith('/'): # Matched string is //...EOL or /*...*/ ==> Blot out all non-newline chars 
      return blotOutNonNewlines(s) 
     else:     # Matched string is '...' or "..." ==> Keep unchanged 
      return s 

    pattern = re.compile(
     r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"', 
     re.DOTALL | re.MULTILINE 
    ) 

    return re.sub(pattern, replacer, text)

來源

2013-08-14 14:36:25

以下爲我工作：

from subprocess import check_output 

class Util: 
    def strip_comments(self,source_code): 
    process = check_output(['cpp', '-fpreprocessed', source_code],shell=False) 
    return process 

if __name__ == "__main__": 
    util = Util() 
    print util.strip_comments("somefile.ext")

這是子和cpp預處理的組合。對於我的項目，我有一個名爲「Util」的實用程序類，它保留了我使用/需要的各種工具。

來源

2013-09-25 05:27:45

Python代碼片段刪除C和C++註釋

回答

相關問題