2016-05-16 55 views
1

我得到一個Python的錯誤,我無法理解。我簡化了我的代碼非常最低限度:lxml.etree.XPathEvalError:無效的表達式

response = requests.get('http://pycoders.com/archive') 
tree = html.fromstring(response.text) 
r = tree.xpath('//divass="campaign"]/a/@href') 
print(r) 

,並仍然得到錯誤

Traceback (most recent call last): 
File "ultimate-1.py", line 17, in <module> 
r = tree.xpath('//divass="campaign"]/a/@href') 
File "lxml.etree.pyx", line 1509, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:50702) 
File "xpath.pxi", line 318, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:145954) 
File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962) 
File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817) 
lxml.etree.XPathEvalError: Invalid expression 

有人會具有其中的問題是來自的想法?可能它是一個依賴關係問題?謝謝。

回答

1

表達式'//divass="campaign"]/a/@href'在語法上不正確,沒有多大意義。相反,你的意思是檢查class屬性:

//div[@class="campaign"]/a/@href 

現在,這將有助於避免無效的表達錯誤,但你會得到什麼用表達式中。這是因爲requests收到的響應中沒有數據。您需要模仿瀏覽器所做的操作來獲取所需數據,並提出額外請求以獲取包含廣告系列的JavaScript文件。

這裏對我來說是什麼工作:

import ast 
import re 

import requests 
from lxml import html 

with requests.Session() as session: 
    # extract script url 
    response = session.get('http://pycoders.com/archive') 
    tree = html.fromstring(response.text) 
    script_url = tree.xpath("//script[contains(@src, 'generate-js')]/@src")[0] 

    # get the script 
    response = session.get(script_url) 
    data = ast.literal_eval(re.match(r'document.write\((.*?)\);$', response.content).group(1)) 

    # extract the desired data 
    tree = html.fromstring(data) 
    campaigns = [item.attrib["href"].replace("\\", "") for item in tree.xpath('//div[@class="campaign"]/a')] 
    print(campaigns) 

打印:

['http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=3384ab2140', 
... 
'http://us4.campaign-archive2.com/?u=9735795484d2e4c204da82a29&id=8b91cb0481' 
] 
+0

謝謝!我必須做response.content.decode('utf-8')來使它工作。 – Bastien

0

ü是錯誤做出的XPath。 如果ü要採取一切的HREF您的XPath應該像

hrefs = tree.xpath('//div[@class="campaign"]/a') 
for href in hrefs: 
    print(href.get('href')) 

或一條線:

hrefs = [item.get('href') for item in tree.xpath('//div[@class="campaign"]/a')]