Antlr解析python安裝文件

我有一個java程序，必須解析python setup.py文件以從中提取信息。我有些東西在工作，但我撞到了牆上。我首先從一個簡單的原始文件開始，一旦我運行，然後我會擔心會剝離出我不想讓它反映實際文件的噪音。Antlr解析python安裝文件

因此，這裏是我的語法

grammar SetupPy ; 

file_input: (NEWLINE | setupDeclaration)* EOF; 

setupDeclaration : 'setup' '(' method ')'; 
method : setupRequires testRequires; 
setupRequires : 'setup_requires' '=' '[' LISTVAL* ']' COMMA; 
testRequires : 'tests_require' '=' '[' LISTVAL* ']' COMMA; 

WS: [ \t\n\r]+ -> skip ; 
COMMA : ',' -> skip ; 
LISTVAL : SHORT_STRING ; 

UNKNOWN_CHAR 
: . 
; 

fragment SHORT_STRING 
: '\'' (STRING_ESCAPE_SEQ | ~[\\\r\n\f'])* '\'' 
| '"' (STRING_ESCAPE_SEQ | ~[\\\r\n\f"])* '"' 
; 

/// stringescapeseq ::= "\" <any source character> 
fragment STRING_ESCAPE_SEQ 
: '\\' . 
| '\\' NEWLINE 
; 

fragment SPACES 
: [ \t]+ 
; 

NEWLINE 
: ({atStartOfInput()}? SPACES 
    | ('\r'? '\n' | '\r' | '\f') SPACES? 
    ) 
    { 
    String newLine = getText().replaceAll("[^\r\n\f]+", ""); 
    String spaces = getText().replaceAll("[\r\n\f]+", ""); 
    int next = _input.LA(1); 
    if (opened > 0 || next == '\r' || next == '\n' || next == '\f' || next == '#') { 
     // If we're inside a list or on a blank line, ignore all indents, 
     // dedents and line breaks. 
     skip(); 
    } 
    else { 
     emit(commonToken(NEWLINE, newLine)); 
     int indent = getIndentationCount(spaces); 
     int previous = indents.isEmpty() ? 0 : indents.peek(); 
     if (indent == previous) { 
     // skip indents of the same size as the present indent-size 
     skip(); 
     } 
     else if (indent > previous) { 
     indents.push(indent); 
     emit(commonToken(Python3Parser.INDENT, spaces)); 
     } 
     else { 
     // Possibly emit more than 1 DEDENT token. 
     while(!indents.isEmpty() && indents.peek() > indent) { 
      this.emit(createDedent()); 
      indents.pop(); 
     } 
     } 
    } 
    } 
;

和我目前的測試文件（就像我說的，從一個普通的文件剝離噪音下一步）

setup(
    setup_requires=['pytest-runner'], 
    tests_require=['pytest', 'unittest2'], 
)

我在哪裏卡住是如何告訴antlr setup_requires和tests_requires包含數組。我想要這些數組的值，無論是否有人使用單引號，雙引號，不同行上的每個值以及上述所有組合。我不知道如何解決這個問題。我可以得到一些幫助嗎？也許是一個例子或兩個？

需要注意的事項，

不，我不能用Jython和公正運行該文件。
正則表達式是不是一種選擇，由於在開發樣式文件

當然，本次發行後，我還需要弄清楚如何從一個普通的文件剝離噪聲和巨大的變化。我嘗試使用Python3語法來做到這一點，但我在antlr上是個新手，它把我吹走了。我無法弄清楚如何編寫規則來拉取值，所以我決定嘗試一個更簡單的語法。並迅速撞上另一堵牆。

編輯這裏是一個實際的setup.py文件，它最終必須解析。請記住setup_requires和test_requires可能會或可能不會在那裏，並且可能會或可能不會按此順序。

# -*- coding: utf-8 -*- 
from __future__ import with_statement 

from setuptools import setup 


def get_version(fname='mccabe.py'): 
    with open(fname) as f: 
     for line in f: 
      if line.startswith('__version__'): 
       return eval(line.split('=')[-1]) 


def get_long_description(): 
    descr = [] 
    for fname in ('README.rst',): 
     with open(fname) as f: 
      descr.append(f.read()) 
    return '\n\n'.join(descr) 


setup(
    name='mccabe', 
    version=get_version(), 
    description="McCabe checker, plugin for flake8", 
    long_description=get_long_description(), 
    keywords='flake8 mccabe', 
    author='Tarek Ziade', 
    author_email='[email protected]', 
    maintainer='Ian Cordasco', 
    maintainer_email='[email protected]', 
    url='https://github.com/pycqa/mccabe', 
    license='Expat license', 
    py_modules=['mccabe'], 
    zip_safe=False, 
    setup_requires=['pytest-runner'], 
    tests_require=['pytest'], 
    entry_points={ 
     'flake8.extension': [ 
      'C90 = mccabe:McCabeChecker', 
     ], 
    }, 
    classifiers=[ 
     'Development Status :: 5 - Production/Stable', 
     'Environment :: Console', 
     'Intended Audience :: Developers', 
     'License :: OSI Approved :: MIT License', 
     'Operating System :: OS Independent', 
     'Programming Language :: Python', 
     'Programming Language :: Python :: 2', 
     'Programming Language :: Python :: 2.7', 
     'Programming Language :: Python :: 3', 
     'Programming Language :: Python :: 3.3', 
     'Programming Language :: Python :: 3.4', 
     'Programming Language :: Python :: 3.5', 
     'Programming Language :: Python :: 3.6', 
     'Topic :: Software Development :: Libraries :: Python Modules', 
     'Topic :: Software Development :: Quality Assurance', 
    ], 
)

試圖調試和簡化和實現我不需要找到方法，只是值。所以我正在玩這個語法

grammar SetupPy ; 

file_input: (ignore setupRequires ignore | ignore testRequires ignore)* EOF; 

setupRequires : 'setup_requires' '=' '[' dependencyValue* (',' dependencyValue)* ']'; 
testRequires : 'tests_require' '=' '[' dependencyValue* (',' dependencyValue)* ']'; 

dependencyValue: LISTVAL; 

ignore : UNKNOWN_CHAR? ; 

LISTVAL: SHORT_STRING; 
UNKNOWN_CHAR: . -> channel(HIDDEN); 

fragment SHORT_STRING: '\'' (STRING_ESCAPE_SEQ | ~[\\\r\n\f'])* '\'' 
| '"' (STRING_ESCAPE_SEQ | ~[\\\r\n\f"])* '"'; 

fragment STRING_ESCAPE_SEQ 
: '\\' . 
| '\\' 
;

很適合簡單的，甚至處理亂序問題。但完整的文件犯規的工作，它被掛在

def get_version(fname='mccabe.py'):

等於在該行的標誌。

來源

2017-07-16 scphantm

您有機會評估我的解決方案嗎？ – TomServo

我終於明白了這一點。不幸的是它打破了一個實際的文件。它拿起進口聲明，並且全是古怪的。我確實發佈了一個實際需要解析的文件的例子。在我放棄之前，我會繼續玩這個遊戲，然後用一種不那麼優雅的方式來解決這個問題。我沒時間了。 – scphantm

是的，解析這個有點多，但是你的UNKNOWN_CHAR符號有問題。幾乎所有的東西都不是隱含的詞法分析器，它強烈地依賴於這個規則。 – TomServo

我檢查了你的語法並簡化了一下。我拿出所有的python-esqe空白處理，並將空白視爲空格。這個語法也解析了這個輸入，正如你在問題中所說的那樣，每行處理一個項目，單引號和雙引號等等。

setup(
    setup_requires=['pytest-runner'], 
    tests_require=['pytest', 
    'unittest2', 
    "test_3" ], 
)

而這裏的大大簡化的語法：

grammar SetupPy ; 
setupDeclaration : 'setup' '(' method ')' EOF; 
method : setupRequires testRequires ; 
setupRequires : 'setup_requires' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ; 
testRequires : 'tests_require' '=' '[' LISTVAL* (',' LISTVAL)* ']' ',' ; 
WS: [ \t\n\r]+ -> skip ; 
LISTVAL : SHORT_STRING ; 
fragment SHORT_STRING 
: '\'' (STRING_ESCAPE_SEQ | ~[\\\r\n\f'])* '\'' 
| '"' (STRING_ESCAPE_SEQ | ~[\\\r\n\f"])* '"' 
; 
fragment STRING_ESCAPE_SEQ 
: '\\' . 
| '\\' 
;

哦，這裏的顯示標記的正確分配的解析器詞法分析器輸出：

[@0,0:4='setup',<'setup'>,1:0] 
[@1,5:5='(',<'('>,1:5] 
[@2,12:25='setup_requires',<'setup_requires'>,2:4] 
[@3,26:26='=',<'='>,2:18] 
[@4,27:27='[',<'['>,2:19] 
[@5,28:42=''pytest-runner'',<LISTVAL>,2:20] 
[@6,43:43=']',<']'>,2:35] 
[@7,44:44=',',<','>,2:36] 
[@8,51:63='tests_require',<'tests_require'>,3:4] 
[@9,64:64='=',<'='>,3:17] 
[@10,65:65='[',<'['>,3:18] 
[@11,66:73=''pytest'',<LISTVAL>,3:19] 
[@12,74:74=',',<','>,3:27] 
[@13,79:89=''unittest2'',<LISTVAL>,4:1] 
[@14,90:90=',',<','>,4:12] 
[@15,95:102='"test_3"',<LISTVAL>,5:1] 
[@16,104:104=']',<']'>,5:10] 
[@17,105:105=',',<','>,5:11] 
[@18,108:108=')',<')'>,6:0] 
[@19,109:108='<EOF>',<EOF>,6:1]

現在你應該能夠遵循簡單的ANTLR訪客或聽衆模式來抓取你的LISTVAL令牌，並與他們做你的事情。我希望這能滿足你的需求。它當然可以很好地解析你的測試輸入，等等。

來源

2017-07-16 21:39:43 TomServo

也許這也是一個upvote？謝謝，我們都知道在這些慢速標籤中rep是多麼的難。 :) – TomServo

Antlr解析python安裝文件

回答

相關問題