爲什麼lxml中的這個元素包含尾部？

考慮這個Python腳本：爲什麼lxml中的這個元素包含尾部？

from lxml import etree 

html = ''' 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head></head> 
    <body> 
    <p>This is some text followed with 2 citations.<span class="footnote">1</span> 
     <span сlass="footnote">2</span>This is some more text.</p> 
    </body> 
</html>''' 

tree = etree.fromstring(html) 

for element in tree.findall(".//{*}span"): 
    if element.get("class") == 'footnote': 
     print(etree.tostring(element, encoding="unicode", pretty_print=True))

所需的輸出將是2個span元素，而不是我得到：

<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">1</span> 
<span xmlns="http://www.w3.org/1999/xhtml" class="footnote">2</span>This is some more text.

爲什麼它包括元素之後的文本直到父結束元件？

我試圖使用LXML鏈接腳註，當我a.insert()的span元素插入a元素我爲它創建，它包括之後的文本等連接大量的文字我不想聯繫的。

來源

2013-11-22 jorbas

指定with_tail=False將刪除尾部文本。

print(etree.tostring(element, encoding="unicode", pretty_print=True, with_tail=False))

請參閱lxml.etree.tostring documentation。

來源

2013-11-22 13:35:37 falsetru

它包含元素之後的文本，因爲該文本屬於該元素。

如果您不希望該文本屬於之前的範圍，則需要將其包含在其自己的元素中。但是，在將元素轉換回XML時，可以避免打印此文本，with_tail=False作爲etree.tostring()的參數。

如果您想將其從特定元素中移除，您還可以簡單地將元素尾部設置爲''。

來源

2013-11-22 13:37:41

我原以爲文本會屬於包含'span'的'p'元素？文本完全在'span'元素之外。 – jorbas

@BigJord是的，它在外面，這就是爲什麼它叫**尾巴**。它不能屬於P元素，因爲它會與第一個文本衝突; 「這是一些文字，後面有2個引文。」 –

爲什麼lxml中的這個元素包含尾部？

回答

相關問題