我想解析一個XML文件,我通過使用adobe pro將PDF導出到xml 1.0。 我正在使用Python和ElementTree來解析。 pdf包含一個跨越多個頁面並具有多個不同表格標題的表格。我想要解析和提取表中的行和列數據,它以包含特定字符串的標題(例如「MECHANICAL」)開頭,並停在下一個表標題部分(例如「COMPLETED」)。從而排除本部分之前和之後的所有行和列數據。有沒有簡單的標籤來解析,標籤模式只是重複。Python ElementTree XML解析
這是我目前的Python代碼:
# Python
import sys
import re # regular expression
import xml.etree.ElementTree as xml
tree = xml.parse("C:/Documents and Settings/alilly.CORPORATE/Desktop/python xml parse/excerpt.xml")
print "=================== Find Columns ===================="
for node in tree.iter('TR'):
print "tag=",node.tag
count = len(node.getiterator('TD'))
#if count != 10:
# continue
print "------------"
for col in node.getiterator('TD'):
print " tag=",col.tag, "attrib=", col.attrib, "text=", col.text
print "=================== Find Headers ===================="
# find headers
for node in tree.iter('ImageData'):
print "figure text = ", node.tail
這裏是我的XML文件:
<?xml version="1.0" encoding="UTF-8" ?>
<!-- Created from PDF via Acrobat SaveAsXML -->
<!-- Mapping Table version: 28-February-2003 -->
<TaggedPDF-doc>
<?xpacket begin='?' id='W5M0MpCehiHzreSzNTczkc9d'?>
<?xpacket begin="?" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c041 52.342996, 2008/05/07-20:48:00 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<pdf:Producer>GPL Ghostscript 8.70</pdf:Producer>
<pdf:Keywords/>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xmp="http://ns.adobe.com/xap/1.0/">
<xmp:ModifyDate>2011-03-01T09:36:13-05:00</xmp:ModifyDate>
<xmp:CreateDate>2011-03-01T09:36:13-05:00</xmp:CreateDate>
<xmp:CreatorTool>PDFCreator Version 1.0.2</xmp:CreatorTool>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>d417764e-466c-11e0-0000-f7ea6a538d79</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:0c6ada50-6db0-4d59-88e1-fc23aa6ebc14</xmpMM:InstanceID>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>xml</dc:format>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">my pdf file</rdf:li>
</rdf:Alt>
</dc:title>
<dc:creator>
<rdf:Seq>
<rdf:li>ltamm</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:description>
<rdf:Alt>
<rdf:li xml:lang="x-default"/>
<rdf:li xml:lang="x-repair"/>
</rdf:Alt>
</dc:description>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
<?xpacket end='r'?>
<Part>
<H1>Misc </H1>
<Sect>
<H3>This is a test </H3>
<Sect>
<H5>Deletions </H5>
<L>
<LI>
<LI_Title>Special codes </LI_Title>
</LI>
</L>
<Figure>
<ImageData src=""/>
</Figure>
<Figure>
<ImageData src=""/>
Main INTERIOR </Figure>
<Table>
<TR>
<TH>S = Standard O = Optional </TH>
</TR>
<TR>
<TD><Figure>
<ImageData src=""/>
</Figure>
</TD>
<TD>S </TD>
</TR>
</Table>
<Figure>
<ImageData src=""/>
This is the MECHANICAL header</Figure>
<Table>
<TR>
<TH>S = Standard O = Optional </TH>
</TR>
<TR>
<TH>Free Flow </TH>
<TD>Ref. Code </TD>
<TD>DESCRIPTION </TD>
<TD>Rooster </TD>
<TD>747 Dog </TD>
<TD>888 Rabbit </TD>
</TR>
<TR>
<TD>xxx GOgo xxB </TD>
<TD>Beany xxx </TD>
<TD>nothing here xxx </TD>
<TD>xxx B </TD>
<TD>snake ddd </TD>
<TD>Cow fff </TD>
<TD>eee </TD>
</TR>
<TR>
<TH/>
<TD/>
<TD>Squirrel Protection </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
</TR>
<TR>
<TH/>
<TD>J77 </TD>
<TD>Rocket Launcher </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
</TR>
<TR>
<TH/>
<TD/>
<TD>Lunch </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
<TD>S </TD>
</TR>
<TR>
<TH/>
<TD>Jss5 </TD>
<TD>Now is the time for all good men </TD>
<TD>-</TD>
<TD>A1 </TD>
<TD>A1 </TD>
<TD>-</TD>
<TD>-</TD>
<TD>-</TD>
<TD>-</TD>
</TR>
<TR>
<TD>Capacity </TD>
<TD/>
<TD>2/3 </TD>
<TD>2/3 </TD>
<TD>2/3 </TD>
</TR>
</Table>
<Figure>
<ImageData src=""/>
Final COMPLETED PAGE 1 OF 2 </Figure>
<Figure>
<ImageData src=""/>
</Figure>
<P>Graphite </P>
<P>painted fun </P>
<P>Control yourself </P>
<Figure>
<ImageData src=""/>
Meaningless Header PAGE 2 OF 2 </Figure>
<Figure>
<ImageData src=""/>
</Figure>
<P>)multi-coat </P>
<P>front</P>
<P>single-slot system </P>
<Figure>
<ImageData src=""/>
Almost Done Header PAGE 1 OF 1 </Figure>
<Figure>
<ImageData src=""/>
</Figure>
<Figure>
<ImageData src=""/>
</Figure>
<Figure>
<ImageData src=""/>
</Figure>
<P>Snow Blizzard. </P>
<P>Done </P>
</Sect>
</Sect>
</Part>
</TaggedPDF-doc>
「import xml.etree.ElementTree as xml」是一個壞主意;你只是破壞了標準的xml包名字空間。更好地將其導入爲「ET」,或者與已知的包或模塊名稱不衝突的東西。 – 2011-06-15 19:40:12