從XML維基轉儲檢索全部文章標題 - Python的

我有出口某一類的所有頁面創建一個XML維基百科轉儲。您可以通過https://en.wikipedia.org/wiki/Special:Export生成一個自己看到這個XML文件的確切結構。現在我想以Python的形式列出每篇文章的標題。我曾嘗試過使用：從XML維基轉儲檢索全部文章標題 - Python的

import xml.etree.ElementTree as ET 

tree = ET.parse('./comp_sci_wiki.xml') 
root = tree.getroot() 

for element in root: 
    for sub in element: 
     print sub.find("title")

什麼都不打印。這似乎應該是一個相對簡單的任務。任何幫助你可以提供將非常感激。謝謝！

來源

2016-04-05 user2585945

如果你看一下導出的文件的開頭，你會看到該文檔定義默認XML命名空間：

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLo

這意味着存在未命名空間中的「標題」元素的文檔中，這是一個原因，爲什麼你的sub.find("title")聲明失敗。你可以看到這一點，如果你要打印出你root元素：

>>> print root 
<Element '{http://www.mediawiki.org/xml/export-0.10/}mediawiki' at 0x7f2a45df6c10>

注意，它並沒有說<Element 'mediawiki'>。標識符包含完整的名稱空間。 This document詳細描述瞭如何使用XML文檔中的命名空間來工作，但TL; DIR版本是，你需要：

>>> from xml.etree import ElementTree as ET 
>>> tree=ET.parse('/home/lars/Downloads/Wikipedia-20160405005142.xml') 
>>> root = tree.getroot() 
>>> ns = 'http://www.mediawiki.org/xml/export-0.10/ 
>>> for page in root.findall('{%s}page' % ns): 
... print (page.find('{%s}title' % ns).text) 
... 
Category:Wikipedia books on computer science 
Computer science in sport 
Outline of computer science 
Category:Unsolved problems in computer science 
Category:Philosophy of computer science 
[...etc...] 
>>>

那你的生活可能會更容易，如果你要安裝的lxml模塊，包括完整的XPath支持，讓您做這樣的事情：

>>> nsmap={'x': 'http://www.mediawiki.org/xml/export-0.10/'} 
>>> for title in tree.xpath('//x:title', namespaces=nsmap): 
... print (title.text) 
... 
Category:Wikipedia books on computer science 
Computer science in sport 
Outline of computer science 
Category:Unsolved problems in computer science 
Category:Philosophy of computer science 
Category:Computer science organizations 
[...etc...]

總之，通過閱讀對命名空間支持的文檔，並希望這加上這些例子將指向您在正確的方向。該外賣應該是XML命名空間是很重要的，並且在title一個命名空間是不一樣的另一個命名空間title。

來源

2016-04-05 01:10:27 larsks

從XML維基轉儲檢索全部文章標題 - Python的

回答

相關問題