錯誤「無法加載外部實體」使用Python lxml的

當我試圖解析XML文檔我從網上檢索，但與此錯誤解析後崩潰：錯誤「無法加載外部實體」使用Python lxml的

': failed to load external entity "<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>

這是第二在下載的XML中。有沒有辦法阻止解析器嘗試加載外部實體，或者以其他方式解決這個問題？這是我的代碼至今：

import urllib2 
import lxml.etree as etree 

file = urllib2.urlopen("http://www.greenbuttondata.org/data/15MinLP_15Days.xml") 
data = file.read() 
file.close() 

tree = etree.parse(data)

來源

2012-05-04 daveeloo

你收到這個錯誤，因爲你加載引用外部資源的XML：

<?xml-stylesheet type="text/xsl" href="GreenButtonDataStyleSheet.xslt"?>

LXML不知道如何解決GreenButtonDataStyleSheet.xslt 。你和我可能會意識到，它將會相對於你原來的網址http://www.greenbuttondata.org/data/15MinLP_15Days.xml有效......訣竅是告訴lxml如何去加載它。

lxml documentation包含標題爲「Document loading and URL resolving」的部分，其中包含您所需的所有信息。

來源

2012-05-05 00:38:49 larsks

您是否知道是否可以關閉加載所有外部資源？我查看了文檔，但找不到任何東西。 – daveeloo

「*您正在收到該錯誤，因爲您正在加載的XML引用了外部資源*」。不，那不是你得到錯誤的原因。請看我的答案。 – mzjn

etree.parse(source)預計source是的

之一的文件名/路徑
文件對象
一個類文件對象使用HTTP或FTP協議

一個URL

問題是您要將XML內容作爲字符串提供。

您也可以不使用urllib2.urlopen()。只需使用

tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml")

驗證（使用lxml的2.3.4）：

>>> from lxml import etree 
>>> tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml") 
>>> tree.getroot() 
<Element {http://www.w3.org/2005/Atom}feed at 0xedaa08> 
>>>

在competing answer，故建議LXML失敗，因爲通過在文檔中的處理指令引用的樣式表。但這不是問題。 lxml不會嘗試加載樣式表，並且如果您按上述方式進行操作，則XML文檔解析得很好。

如果你想實際加載樣式表，你必須明確它。需要是這樣的：

from lxml import etree 

tree = etree.parse("http://www.greenbuttondata.org/data/15MinLP_15Days.xml") 

# Create an _XSLTProcessingInstruction object 
pi = tree.xpath("//processing-instruction()")[0] 

# Parse the stylesheet and return an ElementTree 
xsl = pi.parseXSL()

來源

2012-05-06 10:14:39 mzjn

Downvoter：請解釋這個答案有什麼問題。 – mzjn

謝謝你mzjn。你是對的！ upvoted。 – Duke

@Duke：謝謝！很高興終於得到一些積極的反饋。 – mzjn

在演唱會什麼mzjn說，如果你想傳遞一個字符串來調用etree.parse（），只需將它包裝在一個StringIO對象。

實施例：

from lxml import etree 
from StringIO import StringIO 

myString = "<html><p>blah blah blah</p></html>" 

tree = etree.parse(StringIO(myString))

此方法在lxml documentation使用。

來源

2012-10-20 01:13:14 kevin

對於python3：'從io import StringIO' – Adversus

lxml文檔解析說要解析字符串，請使用fromstring()函數。

parse(...) 
    parse(source, parser=None, base_url=None) 

    Return an ElementTree object loaded with source elements. If no parser 
    is provided as second argument, the default parser is used. 

    The ``source`` can be any of the following: 

    - a file name/path 
    - a file object 
    - a file-like object 
    - a URL using the HTTP or FTP protocol 

    To parse from a string, use the ``fromstring()`` function instead. 

    Note that it is generally faster to parse from a file path or URL 
    than from an open file object or file-like object. Transparent 
    decompression from gzip compressed sources is supported (unless 
    explicitly disabled in libxml2).

來源

2013-06-25 20:00:35 jrwren

錯誤「無法加載外部實體」使用Python lxml的

回答

相關問題