獲取.text_content()
。使用lxml.html
工作樣本:
from lxml.html import fromstring
data = """
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
"""
tree = fromstring(data)
print(tree.xpath("//description")[0].text_content().strip())
打印:
the thing stuff is very important for various reasons, notably other things.
我忘了,雖然指定的一件事,抱歉。我的理想分析版本將包含一個小節列表:[normal(「the thing」),bold(「stuff」),normal(「....」)],這對lxml.html庫是否可行?
假設你只有文本節點和裏面的描述b
元素:
for item in tree.xpath("//description/*|//description/text()"):
print([item.strip(), 'normal'] if isinstance(item, basestring) else [item.text, 'bold'])
打印:
['the thing', 'normal']
['stuff', 'bold']
['is very important for various reasons, notably', 'normal']
['other things', 'bold']
['.', 'normal']
我忘了,雖然指定的一件事,抱歉。我的理想解析版本將包含一個小節列表:[normal(「the thing」),bold(「stuff」),normal(「....」)],這可能與lxml.html庫有關嗎? –
@DanielLovasko肯定,更新。 – alecxe
哇,挺酷的。謝謝! @alecxe –