2010-01-02 122 views
0

我嘗試解析xml文件。在標籤中的文本被成功解析(或者看起來如此),但我想輸出爲不包含在某些標籤中的文本,下面的程序只是忽略它。標籤丟失時解析xml文件

from xml.etree.ElementTree import XMLTreeBuilder 

class HtmlLatex:      # The target object of the parser 
    out = '' 
    var = '' 
    def start(self, tag, attrib): # Called for each opening tag. 
     pass 
    def end(self, tag):    # Called for each closing tag. 
     if tag == 'i': 
      self.out += self.var 
     elif tag == 'sub': 
      self.out += '_{' + self.var + '}' 
     elif tag == 'sup': 
      self.out += '^{' + self.var + '}' 
     else: 
      self.out += self.var 
    def data(self, data): 
     self.var = data 
    def close(self): 
     print(self.out) 


if __name__ == '__main__': 
    target = HtmlLatex() 
    parser = XMLTreeBuilder(target=target) 

    text = '' 
    with open('input.txt') as f1: 
     text = f1.read() 

    print(text) 

    parser.feed(text) 
    parser.close() 

輸入我想分析的一部分: <p><i>p</i><sub>0</sub> = (<i>m</i><sup>3</sup>+(2<i>l</i><sub>2</sub>+<i>l</i><sub>1</sub>) <i>m</i><sup>2</sup>+(<i>l</i><sub>2</sub><sup>2</sup>+2<i>l</i><sub>1</sub> <i>l</i><sub>2</sub>+<i>l</i><sub>1</sub><sup>2</sup>) <i>m</i>) /(<i>m</i><sup>3</sup>+(3<i>l</i><sub>2</sub>+2<i>l</i><sub>1</sub>)) }.</p>

+1

這就像沒有XML我見過。當然你不想要一個_html_解析器? – James 2010-01-02 15:08:25

+0

它是從這裏生產的:http://wims.unice.fr/wims/en_tool~linear~linsolver.en.html 當你得到解決方案時,如果你看看源代碼,你會看到類似的東西。 – 2010-01-02 15:28:46

+1

剛編輯出LaTeX標籤。 ??? – 2010-01-02 17:03:53

回答

2

這是一個pyparsing版本 - 我希望評論足夠說明。

src = """<p><i>p</i><sub>0</sub> = (<i>m</i><sup>3</sup>+(2<i>l</i><sub>2</sub>+<i>l</i><sub>1</sub>) """ \ 
     """<i>m</i><sup>2</sup>+(<i>l</i><sub>2</sub><sup>2</sup>+2<i>l</i><sub>1</sub> <i>l</i><sub>2</sub>+""" \ 
     """<i>l</i><sub>1</sub><sup>2</sup>) <i>m</i>) /(<i>m</i><sup>3</sup>+(3<i>l</i><sub>2</sub>+""" \ 
     """2<i>l</i><sub>1</sub>)) }.</p>""" 

from pyparsing import makeHTMLTags, anyOpenTag, anyCloseTag, Suppress, replaceWith 

# set up tag matching for <sub> and <sup> tags 
SUB,endSUB = makeHTMLTags("sub") 
SUP,endSUP = makeHTMLTags("sup") 

# all other tags will be suppressed from the output 
ANY,endANY = map(Suppress,(anyOpenTag,anyCloseTag)) 

SUB.setParseAction(replaceWith("_{")) 
SUP.setParseAction(replaceWith("^{")) 
endSUB.setParseAction(replaceWith("}")) 
endSUP.setParseAction(replaceWith("}")) 

transformer = (SUB | endSUB | SUP | endSUP | ANY | endANY) 

# now use the transformer to apply these transforms to the input string 
print transformer.transformString(src) 

給人

p_{0} = (m^{3}+(2l_{2}+l_{1}) m^{2}+(l_{2}^{2}+2l_{1} l_{2}+l_{1}^{2}) m) /(m^{3}+(3l_{2}+2l_{1})) }. 
3

看一看BeautifulSoup,一個Python庫用於解析,導航和操作HTML和XML。它有一個方便的界面,可能會解決您的問題...

+0

感謝您的建議。我會看看它。 – 2010-01-02 16:07:50