2017-03-16 44 views
2

這是我的txt文件:如何將.txt文件解析爲.xml文件?

In File Name: C:\Users\naqushab\desktop\files\File 1.m1 
Out File Name: C:\Users\naqushab\desktop\files\Output\File 1.m2 
In File Size: Low: 22636 High: 0 
Total Process time: 1.859000 
Out File Size: Low: 77619 High: 0 

In File Name: C:\Users\naqushab\desktop\files\File 2.m1 
Out File Name: C:\Users\naqushab\desktop\files\Output\File 2.m2 
In File Size: Low: 20673 High: 0 
Total Process time: 3.094000 
Out File Size: Low: 94485 High: 0 

In File Name: C:\Users\naqushab\desktop\files\File 3.m1 
Out File Name: C:\Users\naqushab\desktop\files\Output\File 3.m2 
In File Size: Low: 66859 High: 0 
Total Process time: 3.516000 
Out File Size: Low: 217268 High: 0 

我試圖解析此爲XML格式是這樣的:

<?xml version='1.0' encoding='utf-8'?> 
<root> 
    <filedata> 
     <InFileName>File 1.m1</InFileName> 
     <OutFileName>File 1.m2</OutFileName> 
     <InFileSize>22636</InFileSize> 
     <OutFileSize>77619</OutFileSize> 
     <ProcessTime>1.859000</ProcessTime> 
    </filedata> 
    <filedata> 
     <InFileName>File 2.m1</InFileName> 
     <OutFileName>File 2.m2</OutFileName> 
     <InFileSize>20673</InFileSize> 
     <OutFileSize>94485</OutFileSize> 
     <ProcessTime>3.094000</ProcessTime> 
    </filedata> 
    <filedata> 
     <InFileName>File 3.m1</InFileName> 
     <OutFileName>File 3.m2</OutFileName> 
     <InFileSize>66859</InFileSize> 
     <OutFileSize>217268</OutFileSize> 
     <ProcessTime>3.516000</ProcessTime> 
    </filedata> 
</root> 

下面是代碼(我使用Python 2)在我試圖實現:

import re 
import xml.etree.ElementTree as ET 

rex = re.compile(r'''(?P<title>In File Name: 
         |Out File Name: 
         |In File Size: Low: 
         |Total Process time: 
         |Out File Size: Low: 
        ) 
        (?P<value>.*) 
        ''', re.VERBOSE) 

root = ET.Element('root') 
root.text = '\n' # newline before the celldata element 

with open('Performance.txt') as f: 
    celldata = ET.SubElement(root, 'filedata') 
    celldata.text = '\n' # newline before the collected element 
    celldata.tail = '\n\n' # empty line after the celldata element 
    for line in f: 
     # Empty line starts new celldata element (hack style, uggly) 
     if line.isspace(): 
      celldata = ET.SubElement(root, 'filedata') 
      celldata.text = '\n' 
      celldata.tail = '\n\n' 

     # If the line contains the wanted data, process it. 
     m = rex.search(line) 
     if m: 
      # Fix some problems with the title as it will be used 
      # as the tag name. 
      title = m.group('title') 
      title = title.replace('&', '') 
      title = title.replace(' ', '') 

      e = ET.SubElement(celldata, title.lower()) 
      e.text = m.group('value') 
      e.tail = '\n' 

# Display for debugging 
ET.dump(root) 

# Include the root element to the tree and write the tree 
# to the file. 
tree = ET.ElementTree(root) 
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True) 

但我得到空值,是否有可能將此txt解析爲XML?

+0

你在哪裏得到空值?你可以請更清楚! –

+0

當一個完整的程序*沒有給出預期的結果*時,只需將它分成較小的部分並單獨嘗試。在這裏,您應該首先簡單地解析輸入並打印您可以找到的部分。只有他們嘗試構建一個XML文件。 –

+0

以及您的正則表達式和子元素名稱不匹配!他們是故意的嗎? –

回答

1

與您正則表達式的修正:這應該是

m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line) 

而不是你給什麼。因爲在你的正則表達式中,In File Name|Out File Name的意思是,它會檢查In File Nam後面的,但是eO後面跟着ut File Name等等。

建議,

你可以做到這一點,而不使用正則表達式。 xml.dom.minidom可用於美化您的xml字符串。

爲了更好的理解,我添加了內置評論!

Node.toprettyxml([縮進= 「」[,的NewL = 「」[,編碼= 「」]]])

返回文檔的一個相當印刷版。 indent指定縮進字符串並默認爲製表符;的NewL指定在每行和默認值的端射出的字符串

編輯

import itertools as it 
[line[0] for line in it.groupby(lines)] 

可以在列表行使用itertools包的GROUPBY到組consucutive去重

所以,

import xml.etree.ElementTree as ET 
root = ET.Element('root') 

with open('file1.txt') as f: 
    lines = f.read().splitlines() 

#add first subelement 
celldata = ET.SubElement(root, 'filedata') 

import itertools as it 
#for every line in input file 
#group consecutive dedup to one 
for line in it.groupby(lines): 
    line=line[0] 
    #if its a break of subelements - that is an empty space 
    if not line: 
     #add the next subelement and get it as celldata 
     celldata = ET.SubElement(root, 'filedata') 
    else: 
     #otherwise, split with : to get the tag name 
     tag = line.split(":") 
     #format tag name 
     el=ET.SubElement(celldata,tag[0].replace(" ","")) 
     tag=' '.join(tag[1:]).strip() 

     #get file name from file path 
     if 'File Name' in line: 
      tag = line.split("\\")[-1].strip() 
     elif 'File Size' in line: 
      splist = filter(None,line.split(" ")) 
      tag = splist[splist.index('Low:')+1] 
      #splist[splist.index('High:')+1] 
     el.text = tag 

#prettify xml 
import xml.dom.minidom as minidom 
formatedXML = minidom.parseString(
          ET.tostring(
             root)).toprettyxml(indent=" ",encoding='utf-8').strip() 
# Display for debugging 
print formatedXML 

#write the formatedXML to file. 
with open("Performance.xml","w+") as f: 
    f.write(formatedXML) 

輸出: Performance.xml

<?xml version="1.0" encoding="utf-8"?> 
<root> 
<filedata> 
    <InFileName>File 1.m1</InFileName> 
    <OutFileName>File 1.m2</OutFileName> 
    <InFileSize>22636</InFileSize> 
    <TotalProcesstime>1.859000</TotalProcesstime> 
    <OutFileSize>77619</OutFileSize> 
</filedata> 
<filedata> 
    <InFileName>File 2.m1</InFileName> 
    <OutFileName>File 2.m2</OutFileName> 
    <InFileSize>20673</InFileSize> 
    <TotalProcesstime>3.094000</TotalProcesstime> 
    <OutFileSize>94485</OutFileSize> 
</filedata> 
<filedata> 
    <InFileName>File 3.m1</InFileName> 
    <OutFileName>File 3.m2</OutFileName> 
    <InFileSize>66859</InFileSize> 
    <TotalProcesstime>3.516000</TotalProcesstime> 
    <OutFileSize>217268</OutFileSize> 
</filedata> 
</root> 

希望它能幫助!

+0

完美!只有一件事,我該如何檢查多個新行,因爲生成的txt在開始和結束時可能有一些空行? – naqushab

+0

itertools groupby應該做的伎倆!我已經添加了相同的編輯。 –

0

從文檔(重點是我):

re.VERBOSE
這個標誌可以讓你正則表達式寫得 看起來更好。模式中的空白被忽略,除非在 字符類中或者在前面加上未轉義的反斜槓,並且當 行在字符類中既不包含'#',也不包含前綴爲未轉義的反斜槓的所有字符,最左邊的'#'通過 行結束被忽略。在正則表達式

逃生空間或使用\s