2014-10-02 22 views
5

該文檔說我可以:我可以在Python 3上提供lxml.etree.parse的URL嗎?

lxml可以從本地文件,HTTP URL或FTP URL解析。它也 自動檢測並讀取gzip壓縮的XML文件(.gz)。

(從下「解析器」 http://lxml.de/parsing.html

,但快速的實驗似乎另有暗示:

Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 10:45:13) [MSC v.1600 64 bit (AMD64)] on win32 
Type "help", "copyright", "credits" or "license" for more information. 
>>> from lxml import etree 
>>> parser = etree.HTMLParser() 
>>> from urllib.request import urlopen 
>>> with urlopen('https://pypi.python.org/simple') as f: 
... tree = etree.parse(f, parser) 
... 
>>> tree2 = etree.parse('https://pypi.python.org/simple', parser) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src\lxml\lxml.etree.c:72655) 
    File "parser.pxi", line 1791, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:106263) 
    File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src\lxml\lxml.etree.c:106564) 
    File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src\lxml\lxml.etree.c:105561) 
    File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src\lxml\lxml.etree.c:100456) 
    File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94543) 
    File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:96003) 
    File "parser.pxi", line 618, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:95015) 
OSError: Error reading file 'https://pypi.python.org/simple': failed to load external entity "https://pypi.python.org/simple" 
>>> 

我可以使用的urlopen方法,但文檔似乎暗示傳遞URL以某種方式更好。另外,如果文檔不準確,我有點擔心依賴lxml,特別是如果我開始需要做更復雜的事情。

什麼是從一個已知的URL解析HTML與lxml的正確方法?我應該在哪裏查看記錄?

更新:如果我使用http URL而不是https之一,則會得到相同的錯誤。

+1

它的工作原理爲** ** HTTP URL,而不是HTTPS。 – isedev 2014-10-02 14:39:34

+0

不,http也失敗了,同樣的錯誤。對不起,我應該說(儘管不支持HTTPS使得使用URL的能力有點不安全:-() – 2014-10-02 15:08:57

+0

例如嘗試使用「www.google.com」,但它適用於我。 – isedev 2014-10-02 15:14:18

回答

8

問題是lxml不支持HTTPS url,並且http://pypi.python.org/simple重定向到HTTPS版本。

因此,對於任何安全網站,你需要自己閱讀的網址:

from lxml import etree 
from urllib.request import urlopen 

parser = etree.HTMLParser() 

with urlopen('https://pypi.python.org/simple') as f: 
    tree = etree.parse(f, parser) 
相關問題