2014-09-13 41 views
0

我有這個文檔,我需要解析並獲得一個XML等價物。基本上我需要一個ElementTree類型的對象,但它不會發生。我嘗試了許多不同的組合,但我還沒弄明白。 這裏就是我所做的:使用元素樹模塊解析docx

import xml.etree.ElementTree as ET 
z = zf.ZipFile("INTRODUCTION.docx") 
doc_xml = z.read("word/document.xml") 
print doc_xml   #type(doc_xml) is str 

<?xml version="1.0" encoding="UTF-8" standalone="yes"?> 
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14"><w:body><w:p w:rsidR="00470EEF" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>INTRODUCTION</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>This is a test document for xml</w:t></w:r><w:r><w:t>.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:proofErr w:type="spellStart"/><w:proofErr w:type="gramStart"/><w:r><w:t>Lets</w:t></w:r><w:proofErr w:type="spellEnd"/><w:proofErr w:type="gramEnd"/><w:r><w:t xml:space="preserve"> see how this works.</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRDefault="00456755"/><w:p w:rsidR="00456755" w:rsidRDefault="00456755"/><w:p w:rsidR="00456755" w:rsidRDefault="00456755"><w:pPr><w:rPr><w:b/></w:rPr></w:pPr><w:r w:rsidRPr="00456755"><w:rPr><w:b/></w:rPr><w:t>Conclusion</w:t></w:r></w:p><w:p w:rsidR="00456755" w:rsidRPr="00456755" w:rsidRDefault="00456755"><w:r w:rsidRPr="00456755"><w:t>It should hopefully</w:t></w:r><w:r><w:t>..</w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/></w:p><w:sectPr w:rsidR="00456755" w:rsidRPr="00456755"><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document> 

由於doc_xml是字符串類型的,我用下面來獲取元素。

rooted = ET.fromstring(doc_xml) #type(rooted) is 'Element' 
type(rooted) 

這也太:

tree = ET.ElementTree(doc_xml) #type(tree) is 'ElementTree' 
type(tree) 

我覺得這個作品,但是當我做:

for branch in tree.iter(): 
    print branch 
--------------------------------------------------------------------------- 
AttributeError       Traceback (most recent call last) 
<ipython-input-83-d503315fb5e6> in <module>() 
----> 1 for branch in tree.iter(): 
     2  print branch 

C:\Anaconda\lib\xml\etree\ElementTree.pyc in iter(self, tag) 
    671  def iter(self, tag=None): 
    672   # assert self._root is not None 
--> 673   return self._root.iter(tag) 
    674 
    675  # compatibility 

AttributeError: 'str' object has no attribute 'iter' 

變量tree是ElementTree的類型。我該如何解決這個問題?

+0

樹後把'打印類型(樹)',並添加確保它不是字符串 – gosom 2014-09-13 10:01:57

+0

是它顯示類型ElementTree – 2014-09-13 10:03:45

+0

你能寫一個獨立的腳本並粘貼完整的回溯? – 2014-09-13 10:09:49

回答

3

這一行,

rooted = ET.fromstring(doc_xml) 

你通過解析爲以字符串形式的XML文檔得到Element實例。你可以遍歷這個實例:

for branch in rooted.iter(): 
    print branch 

當你做到這一點,

tree = ET.ElementTree(doc_xml) 

您可以通過給出一個字符串作爲參數創建一個ElementTree實例。這不會導致錯誤消息,但嘗試迭代樹失敗是因爲它不是「真正的」樹(在這種情況下XML未被解析)。


如果你需要一個ElementTree情況下,我建議做這樣的:

import xml.etree.ElementTree as ET 
import zipfile as zf 

z = zf.ZipFile("INTRODUCTION.docx") 
f = z.open("word/document.xml") # a file-like object 
tree = ET.parse(f)    # an ElementTree instance 

for elem in tree.iter(): 
    print elem 
+0

謝謝你的工作。 ElementTree模塊是否可以幫助您在docx中返回特定顏色的字數? – 2014-09-14 08:42:19

+0

您可以使用ElementTree從XML文檔中提取任何信息,但是如果您需要某個特定的功能來處理字數,您必須自己創建它。 – mzjn 2014-09-14 08:53:16