2014-04-30 34 views
-1

我想用Python編寫XPath以獲取li標記的全部內容,包括a標記的內容。用於獲取一個字符串中標記下所有內容的xpath

<li> 
Lake 2014: 9th Biennial Lake Symposium on " 
<a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a> 
", 13-15th November 2014 
</li> 

我寫的XPath作爲

//div[@class='inner_body_left']/ul/li//text(). 

這3個輸出不同的字符串:

Lake 2014: 9th Biennial Lake Symposium on " 
Conservation of Wetland Ecosystems in Western Ghats 
", 13-15th November 2014. 

我怎樣才能讓他們作爲一個字符串?

+0

我只是低估了你的問題,因爲你在每個新評論中都給出了很少的信息。請編輯一個更有價值的問題,並提供所有必需的信息以正確回答它!如果你問一個問題,期待一個有用的答案,你也應該提供所有的信息,否則我認真考慮這是浪費我的時間。 – dirkk

回答

1

樣品的Python shell會話:

>>> import lxml.html 
>>> doc = lxml.html.fromstring("""<div class="inner_body_left"> 
... <ul> 
... <li> 
... Lake 2014: 9th Biennial Lake Symposium on " 
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a> 
... ", 13-15th November 2014 
... </li> 
... </ul> 
... </div>""") 

最簡單的方法是使用string()如果你知道你的XPath表達式匹配只有1個節點,否則string()在僅符合第一個節點沒有轉換去集:

>>> doc.xpath("string(//div[@class='inner_body_left']/ul/li)") 
'\nLake 2014: 9th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2014\n' 

讓所有文本元素:

>>> doc.xpath("//div[@class='inner_body_left']/ul/li//text()") 
['\nLake 2014: 9th Biennial Lake Symposium on "\n', 'Conservation of Wetland Ecosystems in Western Ghats', '\n", 13-15th November 2014\n'] 
>>> doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*/text()") 
['\nLake 2014: 9th Biennial Lake Symposium on "\n', 'Conservation of Wetland Ecosystems in Western Ghats', '\n", 13-15th November 2014\n'] 
a元素(使用的 /descendant-or-self::*[not(self::a)]/代替 //

不包括文本:

>>> doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*[not(self::a)]/text()") 
['\nLake 2014: 9th Biennial Lake Symposium on "\n', '\n", 13-15th November 2014\n'] 
>>> "".join(doc.xpath("//div[@class='inner_body_left']/ul/li/descendant-or-self::*[not(self::a)]/text()")) 
'\nLake 2014: 9th Biennial Lake Symposium on "\n\n", 13-15th November 2014\n' 
>>> 

更新了多個例子元素可供選擇:

>>> doc = """<div class="inner_body_left"> 
... <ul> 
... <li> 
... Lake 2014: 9th Biennial Lake Symposium on " 
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a> 
... ", 13-15th November 2014 
... </li> 
... <li> 
... Lake 2015: 10th Biennial Lake Symposium on " 
... <a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a> 
... ", 13-15th November 2015 
... </li> 
... </ul> 
... </div>""" 
>>> root = lxml.html.fromstring(doc) 
>>> 
>>> import pprint 
>>> pprint.pprint([element.xpath("string(.)") 
...    for element in root.xpath("//div[@class='inner_body_left']/ul/li")]) 
['\nLake 2014: 9th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2014\n', 
'\nLake 2015: 10th Biennial Lake Symposium on "\nConservation of Wetland Ecosystems in Western Ghats\n", 13-15th November 2015\n'] 
>>> pprint.pprint(["".join(element.xpath("./descendant-or-self::*[not(self::a)]/text()")) 
...    for element in root.xpath("//div[@class='inner_body_left']/ul/li")] 
...) 
['\nLake 2014: 9th Biennial Lake Symposium on "\n\n", 13-15th November 2014\n', 
'\nLake 2015: 10th Biennial Lake Symposium on "\n\n", 13-15th November 2015\n'] 
>>> 
+0

實際上,我需要div標記下的許多節點。div標記的所有節點都具有相似的結構。因此,在這種情況下,string()不起作用。其他選項給3個不同的字符串。我想它作爲一個字符串。 – user3446000

+0

看看使用'「」.join()'的最後一個例子。另外,對於doc.xpath(「// div [@ class ='inner_body_left']/ul/li」))' –

+0

@ user3446000,I中的元素,您總是可以使用'element.xpath(「string(。)」)以'div'元素下的多個匹配爲例更新我的答案 –

1

最好的選擇似乎是簡單地使用string()來達到目的。它還從您的XML中提取評論。它整個元素轉換爲XS:字符串:

//div[@class='inner_body_left']/ul/li/string() 

如果這不適合一些業務邏輯方面的原因,則可以隨時連接字符串:

concat(//div[@class='inner_body_left']/ul/li//text()) 
+0

獎金問題:你會如何忽略'a'的文本,但是可以連接所有其他文本節點? – CoDEmanX

+0

這兩個選項都不起作用! – user3446000

+0

爲避免'a'的內容,你可以將xpath寫爲// div [@ class ='inner_body_left']/ul/li/text() – user3446000

0

my solution

我更多使用子後使用

concat(substring(//div/ul/li/text()[1],1,string-length(//div/ul/li/text()[1])-1),//div/ul/li/a/text(),substring(//div/ul/li/text()[2],2)) 

<?xml version="1.0" encoding="UTF-8"?><div> 
    <ul> 
<li> 
Lake 2014: 9th Biennial Lake Symposium on " 
<a target="_blank" href="/events/CES_TVR_LAKE_2014_brochure_2FEb2014.pdf">Conservation of Wetland Ecosystems in Western Ghats</a> 
", 13-15th November 2014 
</li> 
    </ul> 
</div> 

爲了得到我們之前刪除換行符一行和一個功能

相關問題