lxml忽略任何標籤之間的特定標籤

我想從一個巨大的XML文件中提取一些特定的字段。這裏有一個例子：lxml忽略任何標籤之間的特定標籤

<?xml version="1.0" encoding="ISO-8859-1"?> 
<!DOCTYPE dblp SYSTEM "dblp.dtd"> 
    <dblp> 

<article mdate="2009-09-24" key="journals/jasis/GianoliM09"> 
<author>Ernesto Gianoli</author> 
<author>Marco A. Molina-Montenegro</author> 
<title>Insights into the relationship between the <i>h</i>-index and self-citations.</title> 
<pages>1283-1285</pages> 
<year>2009</year> 
<volume>60</volume> 
<journal>JASIST</journal> 
<number>6</number> 
<ee>http://dx.doi.org/10.1002/asi.21042</ee> 
<url>db/journals/jasis/jasis60.html#GianoliM09</url> 
</article> 


<article mdate="2014-09-18" key="journals/iacr/ShiCSL11" publtype="informal publication"> 
<author>Elaine Shi</author> 
<author>T.-H. Hubert Chan</author> 
<author>Emil Stefanov</author> 
<author>Mingfei Li</author> 
<title>blivious RAM with O((log N)<sup>3</sup>) Worst-Case Cost.</title> 
<pages>407</pages> 
<year>2011</year> 
<volume>2011</volume> 
<journal>IACR Cryptology ePrint Archive</journal> 
<ee>http://eprint.iacr.org/2011/407</ee> 
<url>db/journals/iacr/iacr2011.html#ShiCSL11</url> 
</article> 

<phdthesis mdate="2016-05-04" key="phd/it/Popescu2008"> 
<author>Razvan Andrei Popescu</author> 
<title>Aggregation and adaptation of web services: a semi-automated methodology for the aggregation and adaption of web services.</title> 
<year>2008</year> 
<school>University of Pisa</school> 
<pages>1-206</pages> 
<isbn>978-3-8364-6280-8</isbn> 
<ee>http://d-nb.info/991165179</ee> 
</phdthesis><phdthesis mdate="2007-04-26" key="phd/Tsangaris92"> 
<author>Manolis M. Tsangaris</author> 
<title>Principles of Static Clustering for Object Oriented Databases</title> 
<year>1992</year> 
<school>Univ. of Wisconsin-Madison</school> 
</phdthesis> 

<phdthesis mdate="2005-11-30" key="phd/Heuer2002"> 
<author>Andreas Heuer 0002</author> 
<title>Web-Pr&auml;senz-Management im Unternehmen</title> 
<year>2002</year> 
<school>Univ. Trier, FB 4, Informatik</school> 
<ee>http://ubt.opus.hbz-nrw.de/volltexte/2004/144/</ee> 
</phdthesis> 

<mastersthesis mdate="2002-01-03" key="phd/Schulte92"> 
<author>Christian Schulte</author> 
<title>Entwurf und Implementierung eines &uuml;bersetzenden Systems f&uuml;r das intuitionistische logische Programmieren auf der Warren Abstract Machine.</title> 
<year>1991</year> 
<school>Universit&auml;t Karlsruhe, Institut f&uuml;r Logik, Komplexit&auml;t und Deduktionssysteme</school> 
</mastersthesis> 

<phdthesis mdate="2002-01-03" key="phd/Hellerstein95"> 
<author>Joseph M. Hellerstein</author> 
<title>Optimization and Execution Techniques for Queries With Expensive Methods</title> 
<year>1995</year> 
<school>Univ. of Wisconsin-Madison</school> 
</phdthesis> 

</dblp>

我用的是代碼here來分析和提取，我很感興趣的領域的問題出現時，我想提取在第一種情況下的標題和第二種情況，因爲。的h和3標籤。看來我的代碼中看到他們作爲新的事件，但沒有<title>標籤的一部分，我得到以下結果：

title Insights into the relationship between the 
blivious RAM with O((log N)

基本上我拿到冠軍的文字，直到解析器滿足新的標籤。

問題是我不知道有多少這樣的情況下（例如，不同的標籤），否則我可以嘗試手動刪除它們。無論如何要處理這種情況？

來源

2016-07-19 Moj

您需要了解元素內容的lxml數據模型（特別是tail屬性）。這裏解釋得很好：http://infohost.nmt.edu/tcc/help/pubs/pylxml/web/etree-view.html。

的該元素的屬性text的含量，

<title>Insights into the relationship between the <i>h</i>-index and self-citations.</title>

是Insights into the relationship between the。

h位是子元素的text，而-index and self-citations.是同一孩子的tail。

爲了獲得標題的所有文字內容，您可以使用itertext()。例如：

from lxml import etree 

tree = etree.parse("dblp.xml") # The XML in the question 
titles = tree.xpath("//title") 

for title in titles: 
    print ''.join(title.itertext())

輸出：

Insights into the relationship between the h-index and self-citations. 
blivious RAM with O((log N)3) Worst-Case Cost. 
Aggregation and adaptation of web services: a semi-automated methodology for the aggregation and adaption of web services. 
Principles of Static Clustering for Object Oriented Databases 
Web-Präsenz-Management im Unternehmen 
Entwurf und Implementierung eines übersetzenden Systems für das intuitionistische logische Programmieren auf der Warren Abstract Machine. 
Optimization and Execution Techniques for Queries With Expensive Methods

來源

2016-07-19 19:42:59 mzjn

lxml忽略任何標籤之間的特定標籤

回答

相關問題