2010-09-29 58 views
0

我有一個XML文件,其中不包含任何文本。 <TEXT> </TEXT>標籤內附文字。從XML改進文本提取例程

<TEXT> 

<!-- PJG STAG 4703 --> 

<!-- PJG ITAG l=94 g=1 f=1 --> 

<!-- PJG /ITAG --> 

<!-- PJG ITAG l=69 g=1 f=1 --> 

<!-- PJG /ITAG --> 

<!-- PJG ITAG l=50 g=1 f=1 --> 


<USDEPT>DEPARTMENT OF AGRICULTURE</USDEPT> 

<!-- PJG /ITAG --> 

<!-- PJG ITAG l=18 g=1 f=1 --> 

<USBUREAU>Packers and Stockyards Administration</USBUREAU> 
<!-- PJG 0012 frnewline --> 

<!-- PJG /ITAG --> 

<!-- PJG ITAG l=55 g=1 f=1 --> 
Amendment to Certification of Central Filing System_Oklahoma 
<!-- PJG 0012 frnewline --> 

<!-- PJG 0012 frnewline --> 

<!-- PJG /ITAG --> 

<!-- PJG ITAG l=11 g=1 f=1 --> 
The Statewide central filing system of Oklahoma has been previously certified, pursuant to section 1324 of the Food 
Security Act of 1985, on the basis of information submitted by Hannah D. Atkins, Secretary of State, for farm products 
produced in that State (52 FR 49056, December 29, 1987). 
<!-- PJG 0012 frnewline --> 

<!-- PJG 0012 frnewline --> 
The certification is hereby amended on the basis of information submitted by John Kennedy, Secretary of State, for 
additional farm products produced in that State as follows: Cattle semen, cattle embryos, milo. 
<!-- PJG 0012 frnewline --> 

<!-- PJG 0012 frnewline --> 
This is issued pursuant to authority delegated by the Secretary of Agriculture. 
<!-- PJG /ITAG --> 

<!-- PJG QTAG 04 --> 
<!-- PJG /QTAG --> 

<!-- PJG 0012 frnewline --> 

<!-- PJG 0012 frnewline --> 

<!-- PJG ITAG l=21 g=1 f=1 --> 

<!-- PJG /ITAG --> 

<!-- PJG ITAG l=21 g=1 f=4 --> 
Authority: 
<!-- PJG /ITAG --> 

<!-- PJG ITAG l=21 g=1 f=1 --> 
Sec. 1324(c)(2), Pub. L. 99-198, 99 Stat. 1535, 7 U.S.C. 1631(c)(2); 7 CFR 2.18(e)(3), 2.56(a)(3), 55 FR 22795. 
<!-- PJG /ITAG --> 

<!-- PJG QTAG 02 --> 
<!-- PJG /QTAG --> 

<!-- PJG 0012 frnewline --> 

<!-- PJG 0012 frnewline --> 

<!-- PJG ITAG l=21 g=1 f=1 --> 
Dated: January 21, 1994 
<!-- PJG 0012 frnewline --> 

<!-- PJG 0012 frnewline --> 

<!-- PJG 0012 frnewline --> 

<!-- PJG /ITAG --> 

<SIGNER> 
<!-- PJG ITAG l=06 g=1 f=1 --> 
Calvin W. Watkins, Acting Administrator, 
<!-- PJG 0012 frnewline --> 

<!-- PJG /ITAG --> 
</SIGNER> 
<SIGNJOB> 
<!-- PJG ITAG l=04 g=1 f=1 --> 
Packers and Stockyards Administration. 
<!-- PJG 0012 frnewline --> 

<!-- PJG 0012 frnewline --> 

<!-- PJG /ITAG --> 
</SIGNJOB> 
<FRFILING> 
<!-- PJG ITAG l=40 g=1 f=1 --> 
[FR Doc. 94-1847 Filed 1-27-94; 8:45 am] 
<!-- PJG 0012 frnewline --> 

<!-- PJG /ITAG --> 
</FRFILING> 
<BILLING> 
<!-- PJG ITAG l=68 g=1 f=1 --> 
BILLING CODE 3410-KD-P 
<!-- PJG /ITAG --> 
</BILLING> 

<!-- PJG 0012 frnewline --> 

<!-- PJG 0012 frnewline --> 

<!-- PJG /STAG --> 
</TEXT> 

我的任務是從每個這些TEXT節點中提取文本。這是我在做什麼:

def getTextFromXML(): 
    global Text, xmlDoc 
    TextNodes = xmlDoc.getElementsByTagName("TEXT") 
    docstr = '' 
    #Text = [TextFromNode(textNode) for textNode in TextNodes] 
    for textNode in TextNodes: 
     for cNode in textNode.childNodes: 
      if cNode.nodeType == Node.TEXT_NODE: 
       docstr+=cNode.data 
      else: 
       for ccNode in cNode.childNodes: 
        if ccNode.nodeType == Node.TEXT_NODE: 
         docstr+=ccNode.data     
     Text.append(docstr) 

問題是,它是花了很多時間。我想我的功能並不高效。任何人都可以提供一些建議,告訴我這可以改進嗎?

編輯:我正在處理的文件包含大約6000 + <TEXT>文本元素。

回答

1

lxml比標準python庫中包含的xml庫容易得多。它是C libxml2庫的綁定,所以我假設它也更快。

我(使用變量名)做這樣的事情:

from lxml import etree 
with open('some-file.xml') as f: 
    xmlDoc = etree.parse(f) 
    root = xmlDoc.getroot() 

    Text = [] 
    for textNode in root.xpath('TEXT'): 
     docstr = '\n'.join(text.strip() for text in textNode.xpath('*/text() | text()') if text.strip()) 
     Text.append(docstr) 
0

如果(在Python 2.7或xml.etree)使用lxml的,你可以使用.itertext()方法的元素,例如在:

s = ''.join(elem.itertext()) 

隨着LXML,你很可能也使用string() XPath函數(可能會更快,因爲所有的工作都是由libxml2的本身完成,而不是在Python):

s = elem.xpath('string()')