用Python解析XML時處理多個節點

對於一個任務，我需要解析一個200萬行的XML文件，並將數據輸入到MySQL數據庫中。由於我們在類中使用了python環境和sqlite，我試圖用python來解析文件。請記住，我只是學習Python，所以一切都是新的！用Python解析XML時處理多個節點

我已經嘗試了幾次，但不斷失敗並越來越沮喪。爲了提高效率，我出測試我的代碼上完整的XML的只是少量的，在這裏：

<pub> 
<ID>7</ID> 
<title>On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems</title> 
<year>2003</year> 
<booktitle>AVBPA</booktitle> 
<pages>895-902</pages> 
<authors> 
    <author>J. K. Schneider</author> 
    <author>C. E. Richardson</author> 
    <author>F. W. Kiefer</author> 
    <author>Venu Govindaraju</author> 
</authors> 
</pub>

首次嘗試

在這裏，我成功地從每個標籤拉出所有的數據，除非<authors>標籤下有多個作者。我試圖遍歷authors標籤中的每個節點，計數，然後爲這些作者創建一個臨時數組，然後使用SQL將它們放到我的數據庫中。我爲作者數量增加了15個，但顯然只有4個！我該如何解決這個問題？

from xml.dom import minidom 

xmldoc= minidom.parse("test.xml") 

pub = xmldoc.getElementsByTagName("pub")[0] 
ID = pub.getElementsByTagName("ID")[0].firstChild.data 
title = pub.getElementsByTagName("title")[0].firstChild.data 
year = pub.getElementsByTagName("year")[0].firstChild.data 
booktitle = pub.getElementsByTagName("booktitle")[0].firstChild.data 
pages = pub.getElementsByTagName("pages")[0].firstChild.data 
authors = pub.getElementsByTagName("authors")[0] 
author = authors.getElementsByTagName("author")[0].firstChild.data 
num_authors = len(author) 
print("Number of authors: ", num_authors) 

print(ID) 
print(title) 
print(year) 
print(booktitle) 
print(pages) 
print(author)

來源

2017-04-23 douglasrcjames

注意，你都拿到字符的第一作者數這裏，換碼限制結果只有第一作者（索引0），然後獲取其長度：

author = authors.getElementsByTagName("author")[0].firstChild.data 
num_authors = len(author) 
print("Number of authors: ", num_authors)

只是不限制結果讓所有的作者：

author = authors.getElementsByTagName("author") 
num_authors = len(author) 
print("Number of authors: ", num_authors)

您可以使用列表理解以獲得列表中的所有作者姓名而不是作者元素：

author = [a.firstChild.data for a in authors.getElementsByTagName("author")] 
print(author) 
# [u'J. K. Schneider', u'C. E. Richardson', u'F. W. Kiefer', u'Venu Govindaraju']

來源

2017-04-23 06:27:02 har07

我知道我需要訪問數組中的每個變量，但語法上不確定。非常感謝！ – douglasrcjames

嘿@ har07，所以我取得了進展，但是某種意義上，我的一些XML數據是「不好的」......我有一個名稱爲「í」的特殊字符，並出現在「＆iacute」中。在XML文件中。我如何處理這些特殊的語言字符到Python？我得到的錯誤是「ExpatError：undefined entity：」。 – douglasrcjames

用Python解析XML時處理多個節點

回答

相關問題