2012-11-02 41 views
1

我想分析查詢化學元素數據庫。pyparsing查詢化學元素數據庫

數據庫存儲在一個xml文件中。解析該文件會生成一個嵌套的字典,該字典存儲在從collections.OrderedDict繼承的單例對象中。 (即ELEMENTS ['C'] - > {'name':'carbon','neutron':0,'proton':6)這個元素將給我一個有序的字典。 ,...})。相反,要求一個propery會給我一個有序的所有元素值的字典(即ELEMENTS ['proton'] - > {'H':1,'He':2} ...)。 )。

典型的查詢可以是:

mass > 10 or (nucleon < 20 and atomic_radius < 5) 

,其中每個「子查詢」(即質量> 10)將返回與其匹配所述一組元素。

然後,查詢將被轉換並在內部轉換爲一個字符串,該字符串將被進一步評估以產生一組與其匹配的元素的索引。在這種情況下,運算符和/或不是布爾運算符,而是作用於python集合的集合運算符。

我最近發了一篇文章來構建這樣的查詢。感謝我得到的有用答案,我認爲我做了或多或少的工作(我希望有一個很好的方法!),但我仍然有一些與pyparsing相關的問題。

這裏是我的代碼:

import numpy 

from pyparsing import * 

# This import a singleton object storing the datase dictionary as 
# described earlier 
from ElementsDatabase import ELEMENTS 

and_operator = oneOf(['and','&'], caseless=True) 
or_operator = oneOf(['or' ,'|'], caseless=True) 

# ELEMENTS.properties is a property getter that returns the list of 
# registered properties in the database 
props = oneOf(ELEMENTS.properties, caseless=True) 

# A property keyword can be quoted or not. 
props = Suppress('"') + props + Suppress('"') | props 
# When parsed, it must be replaced by the following expression that 
# will be eval later. 
props.setParseAction(lambda t : "numpy.array(ELEMENTS['%s'].values())" % t[0].lower()) 

quote = QuotedString('"') 
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0])) 
float_ = Regex(r'[+-]?(\d+(\.\d*)?)?([eE][+-]?\d+)?').setParseAction(lambda t:float(t[0])) 

comparison_operator = oneOf(['==','!=','>','>=','<', '<=']) 
comparison_expr = props + comparison_operator + (quote | float_ | integer) 
comparison_expr.setParseAction(lambda t : "set(numpy.where(%s)%s%s)" % tuple(t)) 

grammar = Combine(operatorPrecedence(comparison_expr, [(and_operator, 2, opAssoc.LEFT) (or_operator, 2, opAssoc.LEFT)])) 

# A test query 
res = grammar.parseString('"mass  " > 30 or (nucleon == 1)',parseAll=True) 

print eval(' '.join(res._asStringList())) 

我的問題有以下幾點:

1 using 'transformString' instead of 'parseString' never triggers any 
    exception even when the string to be parsed does not match the grammar. 
    However, it is exactly the functionnality I need. Is there is a way to do so ? 

2 I would like to reintroduce white spaces between my tokens in order 
that my eval does not fail. The only way I found to do so it the one 
implemented above. Would you see a better way using pyparsing ? 

遺憾的長期職位,但我想在更深的詳細介紹它的上下文。順便說一句,如果你發現這種方法不好,不要猶豫,告訴我!

非常感謝您的幫助。

埃裏克

回答

1

不用擔心我的關心,我發現周圍的工作。我使用了pyparsing附帶的SimpleBool.py示例(感謝提示Paul)。

基本上,我用下面的方法:

1 for each subquery (i.e. mass > 10), using the setParseAction method, 
I joined a function that returns the set of eleements that matched 
the subquery 

2 then, I joined the following functions for each logical operator (and, 
or and not): 

def not_operator(token): 

    _, s = token[0] 

    # ELEMENTS is the singleton described in my original post 
    return set(ELEMENTS.keys()).difference(s) 

def and_operator(token): 

    s1, _, s2 = token[0] 

    return (s1 and s2) 

def or_operator(token): 

    s1, _, s2 = token[0] 

    return (s1 or s2) 

# Thanks for Paul for the hint. 
grammar = operatorPrecedence(comparison_expr, 
      [(not_token, 1,opAssoc.RIGHT,not_operator), 
      (and_token, 2, opAssoc.LEFT,and_operator), 
      (or_token, 2, opAssoc.LEFT,or_operator)]) 

Please not that these operators acts upon python sets rather than 
on booleans. 

而且,沒有工作。

我希望這種方法能幫助你們任何人。

埃裏克

+0

幹得好!很高興你能夠解決這個問題。這與「Pyparsing入門」中的搜索引擎查詢解析器非常相似 – PaulMcG

+0

哦,順便說一句 - 如果您嘗試「cond_a和cond_b和cond_c」,那麼您將得到'[[cond_a,'和',cond_b ,'和',cond_c]]'傳遞給你的分析動作。處理這種情況的最簡單方法是使用切片:將and_operator更改爲'return all(token [0] [:: 2])'和or_operator以'return any(token [0] [:: 2])''。 – PaulMcG