2015-07-03 30 views
4

我的HTML文件有以下行&NBSP文本

<tr><td>$nbsp;</td><tr> 

但是當我做了解析與LXML:

from lxml import tree as ET 
tree = ET.parse("file.html") 

我收到以下錯誤:

Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
File "lxml.etree.pyx", line 3310, in lxml.etree.parse (src/lxml/lxml.etree.c:72517) 
File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105979) 
File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106278) 
File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105277) 
File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100227) 
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94350) 
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95786) 
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94853) 
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 14, column 159 
+0

有人可能會認爲這是http://stackoverflow.com/questions/19974909/xml-non-breaking-space –

回答

9

Use lxml.html, not lxml.etree, for HTML.&nbsp;合法未在XML中預定義,但可用於HTML。因此:

>>> lxml.html.fromstring('''<tr><td>&nbsp;</td><tr>''') 
<Element div at 0x10a7a5e68> 

...正常工作。


或者,您也可以使用XML等效&nbsp;,這是&#160;,你的文檔中,也可以在XML文件中聲明DOCTYPE,並作爲其內容<!ENTITY nbsp "&#160;">

+0

的副本謝謝,這個作品很棒 – user3293692