2016-04-10 24 views
2

我在爲IBM Rhapsody sbs文件格式構建解析器。但不幸的是,遞歸部分將無法按預期工作。規則pp.Word(pp.printables + " ")可能是問題,因爲它也匹配;{}。但至少;也可以是值的一部分。pyparsing遞歸值列表(ibm rhapsody)

import pyparsing as pp 
import pprint 


TEST = r"""{ foo 
    - key = bla; 
    - value = 1243; 1233; 1235; 
    - _hans = "hammer 
    time"; 
    - HaMer = 765; 786; 890; 
    - value = " 
    #pragma LINK_INFO DERIVATIVE \"mc9s12xs256\" 
     "; 
    - _mText = 12.11.2015::13:20:0; 
    - value = "war"; "fist"; 
    - _obacht = "fish,car,button"; 
    - _id = gibml c0d8-4535-898f-968362779e07; 
    - bam = { boing 
     - key = bla; 
    } 
    { boing 
     - key = bla; 
    } 
} 
""" 


def flat(loc, toks): 
    if len(toks[0]) == 1: 
     return toks[0][0] 

assignment = pp.Suppress("-") + pp.Word(pp.alphanums + "_") + pp.Suppress("=") 

value = pp.OneOrMore(
    pp.Group(assignment + (
     pp.Group(pp.OneOrMore(
      pp.QuotedString('"', escChar="\\", multiline=True) + 
      pp.Suppress(";"))).setParseAction(flat) | 
     pp.Word(pp.alphas) + pp.Suppress(";") | 
     pp.Word(pp.printables + " ") 
    )) 
) 

expr = pp.Forward() 

expr = pp.Suppress("{") + pp.Word(pp.alphas) + (
    value | (assignment + expr) | expr 
) + pp.Suppress("}") 
expr = expr.ignore(pp.pythonStyleComment) 


print TEST 
pprint.pprint(expr.parseString(TEST).asList()) 

輸出:

% python prase.py              
{ foo 
    - key = bla; 
    - value = 1243; 1233; 1235; 
    - _hans = "hammer 
    time"; 
    - HaMer = 765; 786; 890; 
    - value = " 
    #pragma LINK_INFO DERIVATIVE \"mc9s12xs256\" 
     "; 
    - _mText = 12.11.2015::13:20:0; 
    - value = "war"; "fist"; 
    - _obacht = "fish,car,button"; 
    - _id = gibml c0d8-4535-898f-968362779e07; 
    - bam = { boing 
     - key = bla; 
    } 
    { boing 
     - key = bla; 
    } 
} 

['foo', 
['key', 'bla'], 
['value', '1243; 1233; 1235;'], 
['_hans', 'hammer\n time'], 
['HaMer', '765; 786; 890;'], 
['value', '\n #pragma LINK_INFO DERIVATIVE "mc9s12xs256"\n  '], 
['_mText', '12.11.2015::13:20:0;'], 
['value', ['war', 'fist']], 
['_obacht', 'fish,car,button'], 
['_id', 'gibml c0d8-4535-898f-968362779e07;'], 
['bam', '{ boing'], 
['key', 'bla']] 
+0

TEST中是否存在拼寫錯誤?如果在' - bam {boing etc.}'之後的最後一組是' - something = {boing \ n- key = bla; }'?很難看到這種格式應該是什麼,你有各種各樣的OneOrMore在這裏和那裏拋出。我想如果你先停下來寫BNF,事情會更清楚。 – PaulMcG

+0

另外,我強烈建議不要使用匹配太多的表達式,比如'pp.Word(printables +'')' - 閱讀pyparsing的Word類的最新版本,其中包含'excludeChars'參數,以便如果您確實需要像'Word(除了';')之外的任何可打印的東西',然後寫'Word(printables,excludeChars =';')'。 – PaulMcG

+0

不幸的是,這種格式是正確的。一個真實的例子https://github.com/mansam/exploring-rhapsody/blob/master/LightSwitch/LightSwitch.rpy – delijati

回答

2

哇,這是一個混亂的模型格式!我認爲這會讓你接近。我開始試圖描述一個有效的價值表達可能是什麼。我看到每個分組都可以包含';' - 終止的屬性定義或者'{}' - 封閉的嵌套對象。每個對象都包含一個提供對象類型的引導標識符。

困難的問題是我命名爲'value_word'的非常普遍的標記,它幾乎是任何字符組合,只要它不是' - ','{'或'}'即可。 'value_word'定義中的負向視圖負責處理這個問題。我認爲這裏的一個關鍵問題是,我能夠而不是在'value_word'中包含''作爲有效字符,而是讓pyparsing執行其缺省空白跳過,以便有可能具有一個或多個'value_word'組成'attr_value'。

最後起腳(不是在你的測試用例中,但在這個例子中,你掛)是這一行中的屬性「分配」:

  - m_pParent = ; 

所以attr_value必須允許空字符串也。

from pyparsing import * 

LBRACE,RBRACE,SEMI,EQ,DASH = map(Suppress,"{};=-") 

ident = Word(alphas + '_', alphanums+'_').setName("ident") 
guid = Group('GUID' + Combine(Word(hexnums)+('-'+Word(hexnums))*4)) 
qs = QuotedString('"', escChar="\\", multiline=True) 
character_literal = Combine("'" + oneOf(list(printables+' ')) + "'") 
value_word = ~DASH + ~LBRACE + ~RBRACE + Word(printables, excludeChars=';').setName("value_word") 

value_atom = guid | qs | character_literal | value_word 

object_ = Forward() 

attr_value = OneOrMore(object_) | Optional(delimitedList(Group(value_atom+OneOrMore(value_atom))|value_atom, ';')) + SEMI 
attr_value.setName("attr_value") 
attr_defn = Group(DASH + ident("name") + EQ + Group(attr_value)("value")) 
attr_defn.setName("attr_defn") 

object_ <<= Group(
    LBRACE + ident("type") + 
    Group(ZeroOrMore(attr_defn | object_))("attributes") + 
    RBRACE 
    ) 

object_.parseString(TEST).pprint() 

爲了您的測試字符串它給:

[['foo', 
    [['key', ['bla']], 
    ['value', ['1243', '1233', '1235']], 
    ['_hans', ['hammer\n time']], 
    ['HaMer', ['765', '786', '890']], 
    ['value', ['\n #pragma LINK_INFO DERIVATIVE "mc9s12xs256"\n  ']], 
    ['_mText', ['12.11.2015::13:20:0']], 
    ['value', ['war', 'fist']], 
    ['_obacht', ['fish,car,button']], 
    ['_id', [['gibml', 'c0d8-4535-898f-968362779e07']]], 
    ['bam', [['boing', [['key', ['bla']]]], ['boing', [['key', ['bla']]]]]]]]] 

我補充結果名稱可能在處理這些結構幫助。使用object_.parseString(TEST).dump()給出了這樣的輸出:

[['foo', [['key', ['bla']], ['value', ['1243', '1233', '1235']], ['_hans', ['hammer\n time']], ... 
[0]: 
    ['foo', [['key', ['bla']], ['value', ['1243', '1233', '1235']], ['_hans', ['hammer\n time']], ... 
    - attributes: [['key', ['bla']], ['value', ['1243', '1233', '1235']], ['_hans', ['hammer... 
    [0]: 
     ['key', ['bla']] 
     - name: key 
     - value: ['bla'] 
    [1]: 
     ['value', ['1243', '1233', '1235']] 
     - name: value 
     - value: ['1243', '1233', '1235'] 
    [2]: 
     ['_hans', ['hammer\n time']] 
     - name: _hans 
     - value: ['hammer\n time'] 
    [3]: 
     ['HaMer', ['765', '786', '890']] 
     - name: HaMer 
     - value: ['765', '786', '890'] 
    [4]: 
     ['value', ['\n #pragma LINK_INFO DERIVATIVE "mc9s12xs256"\n  ']] 
     - name: value 
     - value: ['\n #pragma LINK_INFO DERIVATIVE "mc9s12xs256"\n  '] 
    [5]: 
     ['_mText', ['12.11.2015::13:20:0']] 
     - name: _mText 
     - value: ['12.11.2015::13:20:0'] 
    [6]: 
     ['value', ['war', 'fist']] 
     - name: value 
     - value: ['war', 'fist'] 
    [7]: 
     ['_obacht', ['fish,car,button']] 
     - name: _obacht 
     - value: ['fish,car,button'] 
    [8]: 
     ['_id', [['gibml', 'c0d8-4535-898f-968362779e07']]] 
     - name: _id 
     - value: [['gibml', 'c0d8-4535-898f-968362779e07']] 
     [0]: 
      ['gibml', 'c0d8-4535-898f-968362779e07'] 
    [9]: 
     ['bam', [['boing', [['key', ['bla']]]], ['boing', [['key', ['bla']]]]]] 
     - name: bam 
     - value: [['boing', [['key', ['bla']]]], ['boing', [['key', ['bla']]]]] 
     [0]: 
      ['boing', [['key', ['bla']]]] 
      - attributes: [['key', ['bla']]] 
      [0]: 
       ['key', ['bla']] 
       - name: key 
       - value: ['bla'] 
      - type: boing 
     [1]: 
      ['boing', [['key', ['bla']]]] 
      - attributes: [['key', ['bla']]] 
      [0]: 
       ['key', ['bla']] 
       - name: key 
       - value: ['bla'] 
      - type: boing 
    - type: foo 

它也成功地解析鏈接的例子,一旦領先版行被刪除。

+0

不錯。我加了一個'負數','日期'和領先'header'的處理,但是效果很好。我不知道'NotAny(〜)'操作符,但總結起來,它的所有'分隔列表'或'對象'。我嘗試了一個更大的(700kB)模型,並花了25s來解析它。我分叉你的回購,我正在玩cython。現在我把它降低到了18秒。我將發佈我的結果到這個回購https://github.com/delijati/pyparsing – delijati

+0

在分叉或者賽道之前,嘗試啓用packrat解析。並不總是有幫助,但對於一些遞歸語法,結果可能是10-100X。 – PaulMcG

+0

試過'pp。enablePackrat()'但內存消耗去了\t 通過屋頂高達4倍。我讀過調用'resetCache'的幫助。 http://stackoverflow.com/questions/26591485/incremental-but-complete-parsing-with-pyparsing。也許用'lru_cache'替換緩存,它具有'maxsize'也可以。 – delijati