2016-04-05 15 views
1

我有我試圖使用python使用unix去除嵌入XML文本中的回車

我收到錯誤,偶爾來處理一些XML字符串內文本的xml文件,迫使車廂內返回他們

我如何在UNIX中的XML文本中刪除這些回車,不刪除他們,因爲這將意味着加入所有的XML記錄在一起的XML腳本的

的例子,我可以解析:

一個XML腳本

的例子,我無法分析由於回車:

<?xml version="1.0"?><script startAt="2015-03-25T20:59:38Z" sessionId="xyz"> 
<message attribute= 'hello world, i am going to add a cariage return 
right now 
even though 
i do not have to'></message></script> 

我的分析後,最終輸出是要看起來像:

<?xml version="1.0"?><script startAt="2015-03-25T20:59:38Z" sessionId="xyz"><message attribute = 'hello world, i am not going to add a cariage return right now'></message></script> 
<?xml version="1.0"?><script startAt="2015-03-25T20:59:38Z" sessionId="xyz"><message attribute= 'hello world, i am going to add a cariage return right now even though i do not have to'></message></script> 

我不想要什麼,是要刪除所有回車,因爲我的最終輸出看起來像:

<?xml version="1.0"?><script startAt="2015-03-25T20:59:38Z" sessionId="xyz"><message attribute= 'hello world, i am not going to add a cariage return right now'></message></script><?xml version="1.0"?><script startAt="2015-03-25T20:59:38Z" sessionId="xyz"><message attribute = 'hello world, i am going to add a cariage return right now even though i do not have to'></message></script> 
+0

xml.Etree,LXML –

+0

刪除與新行'TR -d「\ n'' –

回答

0

首先示例不是有效的xml。它可以是這樣的:

<?xml version="1.0"?><script startAt="2015-03-25T20:59:38Z" sessionId="xyz"> 
<message attribute = 'hello world, i am going to add a cariage return 
right now 
even though 
i do not have to'/></script> 

或本:

<?xml version="1.0"?><script startAt="2015-03-25T20:59:38Z" sessionId="xyz"> 
<message>hello world, i am going to add a cariage return 
right now 
even though 
i do not have to</message></script> 

而且我還以爲你要刪除\n不回車。

試試這個功能:

import re 
from lxml import etree 

def removeEndl(xml): 
    root = etree.XML(xml) 

    for element in root.xpath('//*'): 
     if element.text is not None: 
     element.text = re.sub(r'\r?\n', '', element.text) 
     for key, value in element.attrib.iteritems(): 
     element.attrib[key] = re.sub(r'\r?\n', '', value) 

    return etree.tostring(root) 
+0

我想刪除\ n但是當我在記事本++中顯示xml數據,\ n沒有顯示爲\ n,它顯示爲CRLF – shecode

+0

當您使用python打開文件時,CRLF被轉換爲LF。無論如何,我編輯的代碼使用正則表達式刪除CRLF或LF。 – apr

+0

「當你用python打開一個文件時,CRLF被轉換爲LF。」假。只有在你的答案中指定了@ sebastian提及的通用換行符模式時纔會發生這種情況。 – pydsigner

0

打開XML文件時,您很可能也使用universal new lines,Python的支持。這將使Python用\n代替任何\r\n\r

要使用它,只需添加一個Ufile open mode

import elementtree.ElementTree as ET 
with open('my.xml', 'rU') as myxml: 
    ET.parse(myxml)