LXML是一個偉大的圖書館。無需使用BeautiulSoup或任何其他。以下是如何讓你尋求額外的信息:
# import lxml HTML parser and HTML output function
from __future__ import print_function
from lxml.html import fromstring
from lxml.etree import tostring as htmlstring
# test HTML for demonstration
raw_html = """
<div class="story-body__inner">
<p>Test para with <b>subtags</b></p>
<blockquote>quote here</blockquote>
<img src="...">
</div>
"""
# parse the HTML into a tree structure
innerTree = fromstring(raw_html)
# find the divs you want
# first by finding all divs with the given CSS selector
divs = innerTree.cssselect('div.story-body__inner')
# but that takes a list, so grab the first of those
div0 = divs[0]
# print that div, and its full HTML representation
print(div0)
print(htmlstring(div0))
# now to find sub-items
print('\n-- etree nodes')
for e in div0.xpath(".//*"):
print(e)
print('\n-- HTML tags')
for e in div0.xpath(".//*"):
print(e.tag)
print('\n-- full HTML text')
for e in div0.xpath(".//*"):
print(htmlstring(e))
注意lxml
功能,如節點的cssselect
和xpath
返回列表,而不是單一的節點。您必須將這些列表編入索引以獲取包含的節點 - 即使只有一個。
要獲得所有子標籤或子HTML可能意味着幾件事:獲取ElementTree
節點,獲取標籤名稱或獲取這些節點的完整HTML文本。這段代碼演示了這三個。它通過使用XPath查詢來完成。有時CSS選擇器更方便,有時XPath。在這種情況下,XPath查詢.//*
的意思是「返回當前節點下任意深度,任意標記名的所有節點」。
在Python 2下運行此結果(在Python下運行相同的代碼) 3,雖然輸出文本略有不同,爲etree.tostring
收益字節字符串不Unicode字符串的Python 3)
<Element div at 0x106eac8e8>
<div class="story-body__inner">
<p>Test para with <b>subtags</b></p>
<blockquote>quote here</blockquote>
<img src="..."/>
</div>
-- etree nodes
<Element p at 0x106eac838>
<Element b at 0x106eac890>
<Element blockquote at 0x106eac940>
<Element img at 0x106eac998>
-- HTML tags
p
b
blockquote
img
-- full HTML text
<p>Test para with <b>subtags</b></p>
<b>subtags</b>
<blockquote>quote here</blockquote>
<img src="..."/>
下我的情況beautifullsoup不工作得很好,給了我incorrent HTML標籤!我一定要找到 – Mehdi