Python：從XML樹中的標記中提取文本

我目前正在解析維基百科轉儲，試圖提取一些有用的信息。解析發生在XML中，我只想提取每個頁面的文本/內容。現在我想知道如何在另一個標籤內的標籤內找到所有文本。我搜索了類似的問題，但只發現了單個標籤有問題的問題。這裏是什麼，我想實現一個例子：Python：從XML樹中的標記中提取文本

<revision> 
    <timestamp>2001-01-15T13:15:00Z</timestamp> 
    <contributor> 
     <username>Foobar</username> 
     <id>65536</id> 
    </contributor> 
    <comment>I have just one thing to say!</comment> 
    <text>A bunch of [[text]] here.</text> 
    <minor /> 
    </revision> 

    <example_tag> 
    <timestamp>2001-01-15T13:15:00Z</timestamp> 
    <contributor> 
     <username>Foobar</username> 
     <id>65536</id> 
    </contributor> 
    <comment>I have just one thing to say!</comment> 
    <text>A bunch of [[text]] here.</text> 
    <minor /> 
    </example_tag>

我怎樣才能提取文本標籤中的文本，但只有當它被包含在版本樹？

來源

2017-03-17 J. Williams

可以使用xml.etree.elementtree包爲和使用XPath查詢：

import xml.etree.ElementTree as ET 

root = ET.fromstring(the_xml_string) 
for content in root.findall('.//revision/othertag'): 
    # ... process content, for instance 
    print(content.text)

（其中the_xml_string是包含XML代碼的字符串）。

或者，獲取與列表中理解文本元素的列表：

import xml.etree.ElementTree as ET 

texts = [content.text for content inET.fromstring(the_xml_string).findall('.//revision/othertag')]

所以.text具有內部文本。請注意，您將不得不用標籤替換othertag（例如text）。如果該標籤可以是任意深revision標籤，則應該使用.//revision//othertag作爲XPath查詢。

來源

2017-03-17 10:48:49

Python：從XML樹中的標記中提取文本

回答

相關問題