2015-11-20 114 views
3

我試圖解析像Python的XPath的語法錯誤:無效的謂詞

<document> 
    <pages> 

    <page> 
     <paragraph>XBV</paragraph> 

     <paragraph>GHF</paragraph> 
    </page> 

    <page> 
     <paragraph>ash</paragraph> 

     <paragraph>lplp</paragraph> 
    </page> 

    </pages> 
</document> 

一個XML這裏是我的代碼

import xml.etree.ElementTree as ET 

tree = ET.parse("../../xml/test.xml") 

root = tree.getroot() 

path="./pages/page/paragraph[text()='GHF']" 

print root.findall(path) 

,但我得到一個錯誤

print root.findall(path) 
    File "X:\Anaconda2\lib\xml\etree\ElementTree.py", line 390, in findall 
    return ElementPath.findall(self, path, namespaces) 
    File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 293, in findall 
    return list(iterfind(elem, path, namespaces)) 
    File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 263, in iterfind 
    selector.append(ops[token[0]](next, token)) 
    File "X:\Anaconda2\lib\xml\etree\ElementPath.py", line 224, in prepare_predicate 
    raise SyntaxError("invalid predicate") 
SyntaxError: invalid predicate 

是什麼錯誤與我的xpath?

跟進

感謝falsetru,您的解決方案工作。我有一個後續。現在,我想要使用文字GHF來獲得段落前的所有段落元素。所以在這種情況下,我只需要XBV元素。我想忽略ashlplp。我想這樣做的一種方法是

result = [] 
for para in root.findall('./pages/page/'): 
    t = para.text.encode("utf-8", "ignore") 
    if t == "GHF": 
     break 
    else: 
     result.append(para) 

但是有沒有更好的方法來做到這一點?

回答

9

ElementTree's XPath support is limited.使用其他圖書館一樣lxml

import lxml.etree 
root = lxml.etree.parse('test.xml') 

path="./pages/page/paragraph[text()='GHF']" 
print root.xpath(path) 
+0

感謝的人!我可以做些什麼像text.contains(「東西」)和text.notContains(「東西」)? – AbtPst

+1

@AbtPst,您可以:'path =「./ pages/page/paragraph [contains(text(),'something')]」 '/'path =「./ pages/page/paragraph [not(contains文本(),'東西'))]「'' – falsetru

+0

不,你不能'find_all' http://stackoverflow.com/questions/2637760/how-do-i-match-contents-of-an-element-in -xpath-lxml自'def prepare_predicate(next,token)'失敗 – SIslam

0

正如@falsetru提到,ElementTree不支持text()謂詞,但它支持文本子元素匹配,所以在這個例子中,可以搜索對於具有特定文本的paragraphpage,使用路徑./pages/page[paragraph='GHF']。這裏的問題是page中有多個paragraph標籤,因此需要針對具體paragraph進行迭代。就我而言,我需要找到一個dependencyversion中看到maven pom.xml,有且只有一個孩子version所以下面的工作:

In [1]: import xml.etree.ElementTree as ET 

In [2] ns = {"pom": "http://maven.apache.org/POM/4.0.0"} 

In [3] print ET.parse("pom.xml").findall(".//pom:dependencies/pom:dependency[pom:artifactId='some-artifact-with-hardcoded-version']/pom:version", ns)[0].text 
Out[1]: '1.2.3'