2013-01-05 97 views
1

我有一組500-600個文件,我想要搜索並提取數據。我試圖用非常有限的成功pyparsing。文件(1)註釋中只有3件事,(2)簡單賦值和(3)嵌套賦值。嵌套深度大約6層。pyparsing嵌套賦值

我的目標是查看3級深度字段中的特定值,並且如果它具有特定值,則從另一個屬於相同第二級字段的第三級字段中抽取一個值。

首先,pyparsing這樣做的適當工具?其他建議,如果不是?

我知道如何構建一個文件列表並對它們進行迭代。讓我展示一個示例文件,然後顯示我正在嘗試的代碼。

# TOP_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
TOP_OBJECT= 
(
    obj_fmt= 
    (
    obj_name="foo" 
    obj_cre_date=737785182 # = Tue May 18 23:19:42 1993 
    opj_data= 
    (
     a="continue" 
     b="quit" 
    ) 
    obj_version=264192 # = Version 4.8.0 
    ) 

# LEVEL1_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
    LEVEL1_OBJECT= 
    (
     OBJ_part= 
     (
     obj_type=1005 
     obj_size=120 
     ) 

# LEVEL2_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
     LEVEL2_OBJECT_A= 
     (
      OBJ_part= 
      (
      obj_type=3001 
      obj_size=128 
      ) 

      Another_part= 
      (
       another_attr= 
       (
       another_style=0 
       another_param=2 
       ) 
      ) 
     ) ### End of LEVEL2_OBJECT_A ### 
# LEVEL2_OBJECT ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
     LEVEL2_OBJECT_B= 
     (
      OBJ_part= 
      (
      obj_type=3005 
      obj_size=128 
      ) 

      Another_part= 
      (
       another_attr= 
       (
       another_style=0 
       another_param=8 
       ) 
      ) 
     ) ### End of LEVEL2_OBJECT_B ### 
    ) ### End of LEVEL1 OBJECT 
) ### End of TOP_OBJECT ### 

我的代碼來消化文件看起來像這樣:

from pyparsing import * 

def Syntax(): 
    comment = Group("#" + restOfLine).suppress() 
    eq = Literal('=') 
    lpar = Literal('(').suppress() 
    rpar = Literal(')').suppress() 
    num = Word(nums) 
    var = Word(alphas + "_") 
    simpleAssign = var + eq 
    nestedAssign = Group(lpar + OneOrMore(simpleAssign) + rpar) 
    expr = Forward() 
    atom = nestedAssign | simpleAssign 
    expr << atom 
    expr.ignore(comment) 
    return expr 

def main(): 

    expr = Syntax() 
    results = expr.parseFile("for_show.asc") 
    print results 

if __name__ == '__main__': 
    main() 

我的成績不降:[ 'TOP_OBJECT', '=']

現在我不處理引用的字符串或數字,只是想了解解析嵌套列表。

回答

1

晴那裏,只是在你的解析器的幾個缺口 - 見註釋掉的原代碼,相比於當前的代碼:

def Syntax(): 
    comment = Group("#" + restOfLine).suppress() 
    eq = Literal('=') 
    lpar = Literal('(').suppress() 
    rpar = Literal(')').suppress() 
    num = Word(nums) 
    #~ var = Word(alphas + "_") 
    var = Word(alphas + "_", alphanums+"_") 
    #~ simpleAssign = var + eq 
    expr = Forward() 
    simpleAssign = var + eq + (num | quotedString) 
    #~ nestedAssign = Group(lpar + OneOrMore(simpleAssign) + rpar) 
    nestedAssign = var + eq + Group(lpar + OneOrMore(expr) + rpar) 
    atom = nestedAssign | simpleAssign 
    expr << atom 
    expr.ignore(comment) 
    return expr 

這給:

['TOP_OBJECT', 
'=', 
['obj_fmt', 
    '=', 
    ['obj_name', 
    '=', 
    '"foo"', 
    'obj_cre_date', 
    '=', 
    '737785182', 
    'opj_data', 
    '=', 
    ['a', '=', '"continue"', 'b', '=', '"quit"'], 
    'obj_version', 
    '=', 
    '264192'], 
    'LEVEL1_OBJECT', 
    '=', 
    ['OBJ_part', 
    '=', 
    ['obj_type', '=', '1005', 'obj_size', '=', '120'], 
    'LEVEL2_OBJECT_A', 
    '=', 
    ['OBJ_part', 
    '=', 
    ['obj_type', '=', '3001', 'obj_size', '=', '128'], 
    'Another_part', 
    '=', 
    ['another_attr', 
    '=', 
    ['another_style', '=', '0', 'another_param', '=', '2']]], 
    'LEVEL2_OBJECT_B', 
    '=', 
    ['OBJ_part', 
    '=', 
    ['obj_type', '=', '3005', 'obj_size', '=', '128'], 
    'Another_part', 
    '=', 
    ['another_attr', 
    '=', 
    ['another_style', '=', '0', 'another_param', '=', '8']]]]]] 

如果您纏繞expr nestedAssign的OneOrMore與組內

nestedAssign = var + eq + Group(lpar + OneOrMore(Group(expr)) + rpar) 

,我想你會得到更好的s tructure你重複嵌套任務:

['TOP_OBJECT', 
'=', 
[['obj_fmt', 
    '=', 
    [['obj_name', '=', '"foo"'], 
    ['obj_cre_date', '=', '737785182'], 
    ['opj_data', '=', [['a', '=', '"continue"'], ['b', '=', '"quit"']]], 
    ['obj_version', '=', '264192']]], 
    ['LEVEL1_OBJECT', 
    '=', 
    [['OBJ_part', 
    '=', 
    [['obj_type', '=', '1005'], ['obj_size', '=', '120']]], 
    ['LEVEL2_OBJECT_A', 
    '=', 
    [['OBJ_part', 
     '=', 
     [['obj_type', '=', '3001'], ['obj_size', '=', '128']]], 
     ['Another_part', 
     '=', 
     [['another_attr', 
     '=', 
     [['another_style', '=', '0'], ['another_param', '=', '2']]]]]]], 
    ['LEVEL2_OBJECT_B', 
    '=', 
    [['OBJ_part', 
     '=', 
     [['obj_type', '=', '3005'], ['obj_size', '=', '128']]], 
     ['Another_part', 
     '=', 
     [['another_attr', 
     '=', 
     [['another_style', '=', '0'], ['another_param', '=', '8']]]]]]]]]]] 

另外,你的最初發布的代碼包含的標籤,我覺得他們更麻煩比他們的價值,最好使用4空間縮進。

+0

感謝您的快速反應,清理我的摸索。我會提出建議的更改。 – user1625344