試圖解析Python中的大型XML文件 - 內存錯誤

所以我是一個初學者的'刮刀'，沒有一整套的編程經驗。試圖解析Python中的大型XML文件 - 內存錯誤

我在Canopy環境中使用Python來提取一些下載的XML文件，並使用xml.dom解析器來執行此操作。我只是試圖從第一個美國書目專利授權（這就是爲什麼我使用[0]）只是爲了看看我想要解析和存儲整個數據集;而不是一次完成。從XML摘錄如下：

<?xml version="1.0" encoding="UTF-8"?> 
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]> 
<us-patent-grant lang="EN" dtd-version="v4.2 2006-08-23" file="USD0606726-20091229.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20091214" date-publ="20091229"> 
<us-bibliographic-data-grant> 
<publication-reference> 
<document-id> 
<country>US</country> 
<doc-number>D0606726</doc-number> 
<kind>S1</kind> 
<date>20091229</date> 
</document-id> 
</publication-reference> 
<application-reference appl-type="design"> 
<document-id> 
<country>US</country> 
<doc-number>29299001</doc-number> 
<date>20071217</date>

到目前爲止我的代碼看起來是這樣的：

from xml.dom import minidom 

filename = "C:/Users/SMOLENSK/Documents/Inventor Research/xml_2009/ipg091229.xml" 

f = open(filename, 'r') 

doc = f.read() 

f.close() 

xmldata = '<root>' + doc + '</root>' 

data = minidom.parse(xmldata) 

US_Biblio = xmldata.getElementsByTagName("us-bibliographic-data-grant")[0] 

pat_num = US_Biblio.getElementsByTagName("doc-number")[0] 

dates = pat_num.getElementsByTagName("date") 

for date in dates: 
    print(date)

現在我已經得到了內存錯誤某些消息後的代碼完全運行，但它只是被能夠運行一次，但不幸的是我無法記下發生了什麼。由於數據的高負載（僅此文件爲460萬行），操作每次都會崩潰，我無法複製錯誤。

是否有任何人可以看到錯誤的代碼？我的代碼是在開始存儲每個標記名稱之前解析整個數據集，但是可能有一種方法僅解析一定數量？也許只是用第一組創建一個新的XML文件。

如果你想知道我用繞過的

ExpatError: junk after line xxx

我事先得到的問題。我知道我的編程技巧並不令人驚訝，所以希望我沒有做出簡單而令人厭惡的編程錯誤。

來源

2017-07-28 HelloToEarth

要複製的整個文件中添加''標籤。 'minidom.parse'將帶有'file'對象。嘗試使用'with'和'data = minidom.parse重複（f）' –

嘿，邁克。很抱歉，儘管我確實理解我的'xmldata'是什麼意思，但我不確定如何使用''重新使用''。你能幫助澄清一個偶然的例子嗎？ – HelloToEarth

... [使用Python Iterparse For Large XML Files]（https://stackoverflow.com/q/7171140/2823755）...也許試試lxml。另外，minidomn有一個[unlink]（https://docs.python.org/3/library/xml.dom.minidom.html#xml.dom.minidom.Node.unlink）方法，可以幫助釋放不用的東西。每當你縮小搜索範圍並做一個新的任務（例如''''''''''''''''，嘗試刪除前面的變量，（例如（'''del data'''） – wwii

嘗試：

with open(filename, 'r') as f: 
    data = minidom.parse(f)

如果你真的需要的標籤，您可能需要更動了一下，說不定：

data = minidom.parse(itertools.chain('<root>', f, '</root>')

來源

2017-07-28 02:53:12

當我在'with'語句之外使用'itertools.chain'我得到了同樣的_ExpatError：垃圾郵件在行xxx ..._之後，並在'with'語句中出現錯誤_AttributeError：'itertools.chain'對象沒有屬性'read'_ 我假設第一個又是由於數據本身重複的非確切的XML根元素，但是由於？ – HelloToEarth

，該屬性錯誤發生瞭解析必須需要一個'file'對象（它有一個讀取方法）。我們給它的鏈是一個迭代器返回字符串，但顯然不是解析要的秒。 XML是否形成良好？如果不是，可以嘗試'BeautifulSoup'包來解析它。 –

看看這個（問題）[https://stackoverflow.com/questions/45395811/parsing-xml-with-beautiful-soup]。這是你的問題的重複。 –

試圖解析Python中的大型XML文件 - 內存錯誤

回答

相關問題