使用LXML獲取所有HTML元素

我正試圖在我的HTML文檔中解析一個大的div標記，並且需要在div內部獲取其所有HTML和嵌套標記。我的代碼：使用LXML獲取所有HTML元素

innerTree = fromstring(str(response.text)) 
print("The tags inside the target div are") 
print innerTree.cssselect('div.story-body__inner')

但它打印：

[<Element div at 0x66daed0>]

我希望它裏面返回所有的HTML標籤？如何用LXML做到這一點？

來源

2017-02-17 Mehdi

下我的情況beautifullsoup不工作得很好，給了我incorrent HTML標籤！我一定要找到 – Mehdi

LXML是一個偉大的圖書館。無需使用BeautiulSoup或任何其他。以下是如何讓你尋求額外的信息：

# import lxml HTML parser and HTML output function 
from __future__ import print_function 
from lxml.html import fromstring 
from lxml.etree import tostring as htmlstring 

# test HTML for demonstration 
raw_html = """ 
    <div class="story-body__inner"> 
     <p>Test para with <b>subtags</b></p> 
     <blockquote>quote here</blockquote> 
     <img src="..."> 
    </div> 
""" 

# parse the HTML into a tree structure 
innerTree = fromstring(raw_html) 

# find the divs you want 
# first by finding all divs with the given CSS selector 
divs = innerTree.cssselect('div.story-body__inner') 

# but that takes a list, so grab the first of those 
div0 = divs[0] 

# print that div, and its full HTML representation 
print(div0) 
print(htmlstring(div0)) 

# now to find sub-items 

print('\n-- etree nodes') 
for e in div0.xpath(".//*"): 
    print(e) 

print('\n-- HTML tags') 
for e in div0.xpath(".//*"): 
    print(e.tag) 

print('\n-- full HTML text') 
for e in div0.xpath(".//*"): 
    print(htmlstring(e))

注意lxml功能，如節點的cssselect和xpath返回列表，而不是單一的節點。您必須將這些列表編入索引以獲取包含的節點 - 即使只有一個。

要獲得所有子標籤或子HTML可能意味着幾件事：獲取ElementTree節點，獲取標籤名稱或獲取這些節點的完整HTML文本。這段代碼演示了這三個。它通過使用XPath查詢來完成。有時CSS選擇器更方便，有時XPath。在這種情況下，XPath查詢.//*的意思是「返回當前節點下任意深度，任意標記名的所有節點」。

在Python 2下運行此結果（在Python下運行相同的代碼） 3，雖然輸出文本略有不同，爲etree.tostring收益字節字符串不Unicode字符串的Python 3）

<Element div at 0x106eac8e8> 
<div class="story-body__inner"> 
     <p>Test para with <b>subtags</b></p> 
     <blockquote>quote here</blockquote> 
     <img src="..."/> 
    </div> 


-- etree nodes 
<Element p at 0x106eac838> 
<Element b at 0x106eac890> 
<Element blockquote at 0x106eac940> 
<Element img at 0x106eac998> 

-- HTML tags 
p 
b 
blockquote 
img 

-- full HTML text 
<p>Test para with <b>subtags</b></p> 
<b>subtags</b> 
<blockquote>quote here</blockquote> 
<img src="..."/>

來源

2017-02-17 06:22:32

我希望OP能夠接受你的回答，我誤解了他的問題 –

@Mehdi如果這回答你的問題，請標記爲「首選答案」（通過clic將答案左上角的複選標記爲王）。 –

使用LXML獲取所有HTML元素

回答

相關問題