2010-06-09 60 views
30

基於another SO question構建,如何檢查兩個格式良好的XML片段是否在語義上相同。我所需要的只是「平等」,因爲我使用它來進行單元測試。比較XML片段?

在我想,這將是相等的系統(注意「開始」 秩序和「結束」):

<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
<Stats start="1275955200" end="1276041599"> 
</Stats> 

# Reordered start and end 

<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
<Stats end="1276041599" start="1275955200" > 
</Stats> 

我lmxl,並在我的處置等工具,以及一個簡單的函數只允許對屬性進行重新排序也能很好地工作!


工作代碼片段基於IANB的回答是:

from formencode.doctest_xml_compare import xml_compare 
# have to strip these or fromstring carps 
xml1 = """ <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats start="1275955200" end="1276041599"></Stats>""" 
xml2 = """  <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats end="1276041599" start="1275955200"></Stats>""" 
xml3 = """ <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats start="1275955200"></Stats>""" 

from lxml import etree 
tree1 = etree.fromstring(xml1.strip()) 
tree2 = etree.fromstring(xml2.strip()) 
tree3 = etree.fromstring(xml3.strip()) 

import sys 
reporter = lambda x: sys.stdout.write(x + "\n") 

assert xml_compare(tree1,tree2,reporter) 
assert xml_compare(tree1,tree3,reporter) is False 
+1

'從formencode.doctest_xml_compare進口xml_compare' – laike9m 2015-01-04 08:59:35

回答

24

您可以使用formencode.doctest_xml_compare - xml_compare函數比較兩個ElementTree或lxml樹。

+0

謝謝伊恩,我很高興你已經有這個人解決了! – 2010-06-09 17:37:56

+2

此函數不正確,如果您在xml中交換屬性順序,它將返回False。 – mnowotka 2014-05-29 13:59:08

+0

@mnowotka不正確,它不同於_attributes_以不同的順序相等 – Anentropic 2015-02-06 12:44:15

2

如果你把一個DOM方法,您可以同時穿過兩棵樹,而比較節點(節點類型,文本,屬性),當您去。

遞歸的解決方案將是最優雅的 - 曾經一對節點只是短路進一步比較不「平等」或一旦你在一棵樹檢測葉當它在另一個分支等

+1

這是解決方案,我只是希望有人已經寫了一個。 – 2010-06-11 14:25:01

5

我有同樣的問題:我想要比較兩個文件具有相同的屬性,但順序不同。

lxml中的XML Canonicalization(C14N)似乎適用於此,但我絕對不是XML專家。我很想知道其他人是否可以指出這種方法的缺點。

parser = etree.XMLParser(remove_blank_text=True) 

xml1 = etree.fromstring(xml_string1, parser) 
xml2 = etree.fromstring(xml_string2, parser) 

print "xml1 == xml2: " + str(xml1 == xml2) 

ppxml1 = etree.tostring(xml1, pretty_print=True) 
ppxml2 = etree.tostring(xml2, pretty_print=True) 

print "pretty(xml1) == pretty(xml2): " + str(ppxml1 == ppxml2) 

xml_string_io1 = StringIO() 
xml1.getroottree().write_c14n(xml_string_io1) 
cxml1 = xml_string_io1.getvalue() 

xml_string_io2 = StringIO() 
xml2.getroottree().write_c14n(xml_string_io2) 
cxml2 = xml_string_io2.getvalue() 

print "canonicalize(xml1) == canonicalize(xml2): " + str(cxml1 == cxml2) 

運行這給了我:

$ python test.py 
xml1 == xml2: false 
pretty(xml1) == pretty(xml2): false 
canonicalize(xml1) == canonicalize(xml2): true 
+0

也有這種做法的思想和我正在尋找的弊端,或者這是否可能真正的比較xml文件的規範方法......(雙關語意見) – michuelnik 2014-01-29 22:02:57

+0

我一直在使用這一點在一個網站上運行,比較用於版本控制目的的XML文檔。它工作得很好,但c14n不能控制具有不同順序的相同子元素,所以我有時仍會得到虛假結果。 – 2014-01-30 00:55:33

+0

c14n是否對孩子重新排序?我猜想不會......你的意思是在同一個孩子出現的情況下,但是按照不同的順序,你想要一個「沒有區別」的結果,但是這會產生「差異檢測」?在我看來,孩子的順序可能很重要。 ;) – michuelnik 2014-01-30 13:41:40

1

對這個問題的思考,我想出了以下的解決方案,使XML元素可比性和可排序:

import xml.etree.ElementTree as ET 
def cmpElement(x, y): 
    # compare type 
    r = cmp(type(x), type(y)) 
    if r: return r 
    # compare tag 
    r = cmp(x.tag, y.tag) 
    if r: return r 
    # compare tag attributes 
    r = cmp(x.attrib, y.attrib) 
    if r: return r 
    # compare stripped text content 
    xtext = (x.text and x.text.strip()) or None 
    ytext = (y.text and y.text.strip()) or None 
    r = cmp(xtext, ytext) 
    if r: return r 
    # compare sorted children 
    if len(x) or len(y): 
     return cmp(sorted(x.getchildren()), sorted(y.getchildren())) 
    return 0 

ET._ElementInterface.__lt__ = lambda self, other: cmpElement(self, other) == -1 
ET._ElementInterface.__gt__ = lambda self, other: cmpElement(self, other) == 1 
ET._ElementInterface.__le__ = lambda self, other: cmpElement(self, other) <= 0 
ET._ElementInterface.__ge__ = lambda self, other: cmpElement(self, other) >= 0 
ET._ElementInterface.__eq__ = lambda self, other: cmpElement(self, other) == 0 
ET._ElementInterface.__ne__ = lambda self, other: cmpElement(self, other) != 0 
14

的順序元素在XML中可能是重要的,這可能是爲什麼大多數其他方法建議將比較不等,如果順序不同......即使元素具有相同的屬性和文本內容。

但我也想要一個順序不敏感的比較,所以我想出了這個:

from lxml import etree 
import xmltodict # pip install xmltodict 


def normalise_dict(d): 
    """ 
    Recursively convert dict-like object (eg OrderedDict) into plain dict. 
    Sorts list values. 
    """ 
    out = {} 
    for k, v in dict(d).iteritems(): 
     if hasattr(v, 'iteritems'): 
      out[k] = normalise_dict(v) 
     elif isinstance(v, list): 
      out[k] = [] 
      for item in sorted(v): 
       if hasattr(item, 'iteritems'): 
        out[k].append(normalise_dict(item)) 
       else: 
        out[k].append(item) 
     else: 
      out[k] = v 
    return out 


def xml_compare(a, b): 
    """ 
    Compares two XML documents (as string or etree) 

    Does not care about element order 
    """ 
    if not isinstance(a, basestring): 
     a = etree.tostring(a) 
    if not isinstance(b, basestring): 
     b = etree.tostring(b) 
    a = normalise_dict(xmltodict.parse(a)) 
    b = normalise_dict(xmltodict.parse(b)) 
    return a == b 
+1

這絕對是最好的答案,應該被接受。這是唯一的答案,它實際上關心的是XML中的字段順序無關緊要的事實。 – mnowotka 2014-05-29 13:58:02

+3

有兩件事情需要考慮:_attributes_的順序真的沒有關係。但是元素的順序在XML中很重要,這個代碼適用於你不關心元素順序的特殊情況。 – Anentropic 2014-05-29 14:15:49

0

適應Anentropic's great answer到Python 3(基本上,改變iteritems()items(),並basestringstring):

from lxml import etree 
import xmltodict # pip install xmltodict 

def normalise_dict(d): 
    """ 
    Recursively convert dict-like object (eg OrderedDict) into plain dict. 
    Sorts list values. 
    """ 
    out = {} 
    for k, v in dict(d).items(): 
     if hasattr(v, 'iteritems'): 
      out[k] = normalise_dict(v) 
     elif isinstance(v, list): 
      out[k] = [] 
      for item in sorted(v): 
       if hasattr(item, 'iteritems'): 
        out[k].append(normalise_dict(item)) 
       else: 
        out[k].append(item) 
     else: 
      out[k] = v 
    return out 


def xml_compare(a, b): 
    """ 
    Compares two XML documents (as string or etree) 

    Does not care about element order 
    """ 
    if not isinstance(a, str): 
     a = etree.tostring(a) 
    if not isinstance(b, str): 
     b = etree.tostring(b) 
    a = normalise_dict(xmltodict.parse(a)) 
    b = normalise_dict(xmltodict.parse(b)) 
    return a == b 
+1

你可以爲xmltodict使用'dict_constructor = dict'選項:'xmltodict.parse(a,dict_constructor = dict) ',所以你不需要使用'normalise_dict'函數。 – inoks 2016-06-11 19:16:20

0

由於order of attributes is not significant in XML,您希望忽略由於不同屬性排序和XML canonicalization (C14N)確定性排序屬性s,你可以用這種方法來測試是否相等:

xml1 = b''' <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats start="1275955200" end="1276041599"></Stats>''' 
xml2 = b'''  <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats end="1276041599" start="1275955200"></Stats>''' 
xml3 = b''' <?xml version='1.0' encoding='utf-8' standalone='yes'?> 
    <Stats start="1275955200"></Stats>''' 

import lxml.etree 

tree1 = lxml.etree.fromstring(xml1.strip()) 
tree2 = lxml.etree.fromstring(xml2.strip()) 
tree3 = lxml.etree.fromstring(xml3.strip()) 

import io 

b1 = io.BytesIO() 
b2 = io.BytesIO() 
b3 = io.BytesIO() 

tree1.getroottree().write_c14n(b1) 
tree2.getroottree().write_c14n(b2) 
tree3.getroottree().write_c14n(b3) 

assert b1.getvalue() == b2.getvalue() 
assert b1.getvalue() != b3.getvalue() 

請注意,這個例子假定Python 3。對於Python 3,使用b'''...'''字符串和io.BytesIO是強制性的,而對於Python 2,此方法也適用於普通字符串和io.StringIO

5

這裏一個簡單的解決方案,轉換XML成字典(與xmltodict)和比較字典一起

import json 
import xmltodict 

class XmlDiff(object): 
    def __init__(self, xml1, xml2): 
     self.dict1 = json.loads(json.dumps((xmltodict.parse(xml1)))) 
     self.dict2 = json.loads(json.dumps((xmltodict.parse(xml2)))) 

    def equal(self): 
     return self.dict1 == self.dict2 

單元測試

import unittest 

class XMLDiffTestCase(unittest.TestCase): 

    def test_xml_equal(self): 
     xml1 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
     <Stats start="1275955200" end="1276041599"> 
     </Stats>""" 
     xml2 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
     <Stats end="1276041599" start="1275955200" > 
     </Stats>""" 
     self.assertTrue(XmlDiff(xml1, xml2).equal()) 

    def test_xml_not_equal(self): 
     xml1 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
     <Stats start="1275955200"> 
     </Stats>""" 
     xml2 = """<?xml version='1.0' encoding='utf-8' standalone='yes'?> 
     <Stats end="1276041599" start="1275955200" > 
     </Stats>""" 
     self.assertFalse(XmlDiff(xml1, xml2).equal()) 

或在簡單的Python方法:

import json 
import xmltodict 

def xml_equal(a, b): 
    """ 
    Compares two XML documents (as string or etree) 

    Does not care about element order 
    """ 
    return json.loads(json.dumps((xmltodict.parse(a)))) == json.loads(json.dumps((xmltodict.parse(b)))) 
0

什麼下面的代碼片段嗎?能夠容易地提高包括attribs還有:

def separator(self): 
    return "[email protected]#$%^&*" # Very ugly separator 

def _traverseXML(self, xmlElem, tags, xpaths): 
    tags.append(xmlElem.tag) 
    for e in xmlElem: 
     self._traverseXML(e, tags, xpaths) 

    text = '' 
    if (xmlElem.text): 
     text = xmlElem.text.strip() 

    xpaths.add("/".join(tags) + self.separator() + text) 
    tags.pop() 

def _xmlToSet(self, xml): 
    xpaths = set() # output 
    tags = list() 
    root = ET.fromstring(xml) 
    self._traverseXML(root, tags, xpaths) 

    return xpaths 

def _areXMLsAlike(self, xml1, xml2): 
    xpaths1 = self._xmlToSet(xml1) 
    xpaths2 = self._xmlToSet(xml2)`enter code here` 

    return xpaths1 == xpaths2