樣品的Python shell會話:
>>> import lxml.html
>>> doc = lxml.html.fromstring("""<div class="inner_body_left">
... <ul>
... <li>
... Lake 2014: 9th Biennial Lake Symposium on "
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
... ", 13-15th November 2014
... </li>
... </ul>
... </div>""")
最簡單的方法是使用string()
如果你知道你的XPath表達式匹配只有1個節點,否則string()
在僅符合第一個節點沒有轉換去集:
>>> doc.xpath("string(//div[@class='inner_body_left']/ul/li)")
'\nLake 2014: 9th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2014\n'
讓所有文本元素:
>>> doc.xpath("//div[@class='inner_body_left']/ul/li//text()")
['\nLake 2014: 9th Biennial Lake Symposium on "\n', 'Conservation of Wetland Ecosystems in Western Ghats', '\n", 13-15th November 2014\n']
>>> doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*/text()")
['\nLake 2014: 9th Biennial Lake Symposium on "\n', 'Conservation of Wetland Ecosystems in Western Ghats', '\n", 13-15th November 2014\n']
從
a
元素(使用的
/descendant-or-self::*[not(self::a)]/
代替
//
不包括文本:
>>> doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*[not(self::a)]/text()")
['\nLake 2014: 9th Biennial Lake Symposium on "\n', '\n", 13-15th November 2014\n']
>>> "".join(doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*[not(self::a)]/text()"))
'\nLake 2014: 9th Biennial Lake Symposium on "\n\n", 13-15th November 2014\n'
>>>
更新了多個例子元素可供選擇:
>>> doc = """<div class="inner_body_left">
... <ul>
... <li>
... Lake 2014: 9th Biennial Lake Symposium on "
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
... ", 13-15th November 2014
... </li>
... <li>
... Lake 2015: 10th Biennial Lake Symposium on "
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a>
... ", 13-15th November 2015
... </li>
... </ul>
... </div>"""
>>> root = lxml.html.fromstring(doc)
>>>
>>> import pprint
>>> pprint.pprint([element.xpath("string(.)")
... for element in root.xpath("//div[@class='inner_body_left']/ul/li")])
['\nLake 2014: 9th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2014\n',
'\nLake 2015: 10th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2015\n']
>>> pprint.pprint(["".join(element.xpath("./descendant-or-self::*[not(self::a)]/text()"))
... for element in root.xpath("//div[@class='inner_body_left']/ul/li")]
...)
['\nLake 2014: 9th Biennial Lake Symposium on "\n\n", 13-15th November 2014\n',
'\nLake 2015: 10th Biennial Lake Symposium on "\n\n", 13-15th November 2015\n']
>>>
我只是低估了你的問題,因爲你在每個新評論中都給出了很少的信息。請編輯一個更有價值的問題,並提供所有必需的信息以正確回答它!如果你問一個問題,期待一個有用的答案,你也應該提供所有的信息,否則我認真考慮這是浪費我的時間。 – dirkk