2013-12-11 57 views
1

我試圖從使用lxml的錨鏈接中找到所有圖像(.png,.bmp,.jpg)和可執行文件(.exe)。從這個similar thread,接受的答案建議做這樣的事情:lxml - 查找具有某些擴展名的所有鏈接

png = tree.xpath("//div/ul/li//a[ends-with(@href, '.png')]") 
bmp = tree.xpath("//div/ul/li//a[ends-with(@href, '.bmp')]") 
jpg = tree.xpath("//div/ul/li//a[ends-with(@href, '.jpg')]") 
exe = tree.xpath("//div/ul/li//a[ends-with(@href, '.exe')]") 

不過,我得到不斷收到此錯誤:

Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "lxml.etree.pyx", line 2095, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:53597) 
    File "xpath.pxi", line 373, in lxml.etree.XPathDocumentEvaluator.__call__ (src/lxml/lxml.etree.c:134052) 
    File "xpath.pxi", line 241, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:132625) 
    File "xpath.pxi", line 226, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:132453) 
lxml.etree.XPathEvalError: Unregistered function 

我通過PIP運行LXML 3.2.4。

此外,不是爲每個文件擴展名定義xpath 4次,有沒有辦法使用xpath並一次指定所有四個文件擴展名?

回答

3

ends-with是XPath 2.0中定義的函數,XQuery 1.0和XSLT 2.0,而僅LXML支持XPath 1.0,XSLT 1.0和EXSLT擴展。所以你不能使用這個功能。該文檔是herehere

您可以在XPATH中使用正則表達式。以下是返回節點相匹配的正則表達式的樣本代碼:

regexpNS = 'http://exslt.org/regular-expressions' 
tree.xpath("//a[re:test(@href, '(png|bmp|jpg|exe)$')]", namespaces={'re':regexpNS}") 

這裏有一個類似的問題Python, XPath: Find all links to imagesregular-expressions-in-xpath

0

我認爲這是外部庫無法識別ends-with函數的問題。 documentation discusses working with links。我想一個更好的解決辦法是這樣的:

from urlparse import urlparse 
tree.make_links_absolute(base_href='http://example.com/') 
links = [] 
for i in tree.iterlinks(): 
    url = urlparse(i[2]) # ensures you are getting the remote file path 
    if url.path.endswith('.png') or url.path.endswith('.exe') ... : 
     # there are other ways you could filter the links here 
     links.append(i[2]) 
+0

如果我知道的鏈接生活(即無序列表'「裏面// div/ul/li // a「'),有沒有辦法讓'iterlinks()'只通過無序列表而不是整個dom搜索? – user21398

+0

嗯..當我做'tree.iterlinks()',我得到這個錯誤:'AttributeError:'lxml.etree._ElementTree'對象沒有屬性'iterlinks'.' – user21398

相關問題