2015-06-25 36 views
5

我試圖通過使用elt.itertext()(v3.5.0b1)子樹的文本內容進行迭代如下:LXML錯誤()「ValueError異常:輸入對象沒有元素:HtmlComment」

import lxml.html.soupparser as soupparser 
import requests 

doc = requests.get("http://f10.5post.com/forums/showthread.php?t=1142017").content 
tree = soupparser.fromstring(doc) 

nodes = tree.getchildren() 

for elt in nodes: 
    for t in elt.itertext(): 
     print t 

但我不斷收到一個錯誤說

File "src/lxml/iterparse.pxi", line 248, in lxml.etree.iterwalk.__init__ (src/lxml/lxml.etree.c:134032) 
File "src/lxml/apihelpers.pxi", line 67, in lxml.etree._rootNodeOrRaise (src/lxml/lxml.etree.c:15220) 
ValueError: Input object has no element: HtmlComment 

有沒有辦法跳過所有HTML註釋?此外,這是什麼錯誤實際上意味着?

謝謝

+0

不知道是否有任何內置的方式,除非你使用PullParser – AndyG

+0

@AndyG我不知道爲什麼LXML在絆倒做這個特殊情況。希望我不會需要跳過HTML註釋避免這種錯誤,但。 – Kar

+0

我沒有使用過這個庫,但是我認爲你可以用[BeautifulSoup](https://pypi.python.org/pypi/beautifulsoup4)輕鬆完成你需要的功能。 – rll

回答

0

這是正常現象。

>>> from lxml import etree 
>>> doc = ''' 
... <html><!-- PAGENAV POPUP --> 
...  <div class="vbmenu_popup" id="pagenav_menu" style="display:none"> 
...    <table cellpadding="4" cellspacing="1" border="0"> 
...    <tr> 
...      <td class="thead" nowrap="nowrap">Go to Page...</td> 
...    </tr> 
...    <tr> 
...      <td class="vbmenu_option" title="nohilite"> 
...      <form action="index.php" method="get" onsubmit="return this.gotopage()" id="pagenav_form"> 
...        <input type="text" class="bginput" id="pagenav_itxt" style="font-size:11px" size="4" /> 
...        <input type="button" class="button" id="pagenav_ibtn" value="Go" /> 
...      </form> 
...      </td> 
...    </tr> 
...    </table> 
...  </div> 
... <!--/PAGENAV POPUP --> 
... </html>''' 
>>> root = etree.fromstring(doc) 
>>> nodes = root.getchildren() 
>>> nodes 
[<!-- PAGENAV POPUP -->, <Element div at 0x10367f290>, <!--/PAGENAV POPUP -->] 
>>> for elt in nodes: 
...  for t in elt.itertext(): 
...   print t 
... 
Traceback (most recent call last): 
    File "<stdin>", line 2, in <module> 
    File "lxml.etree.pyx", line 1406, in lxml.etree._Element.itertext (src/lxml/lxml.etree.c:48845) 
    File "lxml.etree.pyx", line 2763, in lxml.etree.ElementTextIterator.__cinit__ (src/lxml/lxml.etree.c:64747) 
    File "iterparse.pxi", line 219, in lxml.etree.iterwalk.__init__ (src/lxml/lxml.etree.c:125303) 
    File "apihelpers.pxi", line 72, in lxml.etree._rootNodeOrRaise (src/lxml/lxml.etree.c:13689) 
ValueError: Input object has no element: lxml.etree._Comment 

正如你可以看到上面

>>> nodes 
[<!-- PAGENAV POPUP -->, <Element div at 0x10367f290>, <!--/PAGENAV POPUP -->] 

注意:的GetChildren已被棄用。你可以使用列表。

>>> list(root) 
[<!-- PAGENAV POPUP -->, <Element div at 0x10367f290>, <!--/PAGENAV POPUP -->] 

節點是元素評論的列表。如果檢查有itertext()工作:

Creates a text iterator. The iterator loops over this element and all subelements, in document order, and returns all inner text.

在另一方面,如果代替迭代就行了,我是直接在根元素上迭代:

>>> for t in root.itertext(): 
...  print t 
... 

我得到的所有文字和很多空間。 :)

如果你仍然想遍歷節點列表上。你可以用

>>> [item.tag for item in nodes] 
[<built-in function Comment>, 'div', <built-in function Comment>] 

推斷自然你也可以做

>>> [item.__class__ for item in nodes] 
[<type 'lxml.etree._Comment'>, <type 'lxml.etree._Element'>, <type 'lxml.etree._Comment'>] 
相關問題