如何檢查lxml元素樹字符串？

我有一個lxml元素樹的列表。我想在一個字典中存儲一個子樹出現在樹列表的任何子樹中的次數。例如如何檢查lxml元素樹字符串？

tree1='''<A attribute1="1"><B><C/></B></A>''' 
tree2='''<A attribute1="1"><D><C attribute="2"/></D></A>''' 
tree3='''<E attribute1="1"><B><C/></B></E>''' 
list_trees=[tree1,tree2,tree3] 
print list_trees 
from collections import defaultdict 
from lxml import etree as ET 
mydict=defaultdict(int) 
for tree in list_trees: 
    root=ET.fromstring(tree) 
    for sub_root in root.iter(): 
     print ET.tostring(sub_root) 
     mydict[ET.tostring(sub_root)]+=1 
print mydict

我得到以下正確的結果：

defaultdict(<type 'int'>, {'<E attribute1="1"><B><C/></B></E>': 1, '<C/>': 2, '<A attribute1="1"><D><C attribute="2"/></D></A>': 1, '<B><C/></B>': 2, '<C attribute="2"/>': 1, '<D><C attribute="2"/></D>': 1, '<A attribute1="1"><B><C/></B></A>': 1})

在這個特殊的例子裏，才能工作。但是，在一般情況下，xmls可以是相同的，但具有不同的屬性排序，或者額外的空白或新行並不重要。但是，這種一般情況會打破我的系統。我知道有關於如何檢查2個相同的xml樹的文章，但是，我想將xml轉換爲字符串以便執行上述特定應用程序（容易地將獨特的樹保留爲字符串，以便於比較和更靈活在將來），也能夠很好地將其存儲在SQL中。無論排序，還是額外的空格，額外的行，xml如何以一致的方式變成字符串？

編輯給出不起作用的情況：這3個xml樹是相同的，它們只是具有不同的屬性或額外空間或新行的順序。

tree4='''<A attribute1="1" attribute2="2"><B><C/></B></A>''' 
tree5='''<A attribute1="1"  attribute2="2" > 
<B><C/></B></A>''' 
tree6='''<A attribute2="2" attribute1="1"><B><C/></B></A>'''

我的輸出提供了以下：

defaultdict(<type 'int'>, {'<B><C/></B>': 3, '<A attribute1="1" attribute2="2"><B><C/></B></A>': 1, '<A attribute1="1" attribute2="2">\n<B><C/></B></A>': 1, '<C/>': 3, '<A attribute2="2" attribute1="1"><B><C/></B></A>': 1})

然而，輸出應該是：

defaultdict(<type 'int'>, {'<B><C/></B>': 3, '<A attribute1="1" attribute2="2"><B><C/></B></A>': 3, '<C/>': 3})

來源

2017-04-26 user2015487

爲什麼不提供樣本XML和預期輸出時* ISN工作，而不是說「這裏有一些可行的XML，但還有一些其他的XML不起作用。」 – miken32

同意。謝謝你的評論。現在編輯。 – user2015487

如果硬要比較XML樹的字符串表示，我建議使用BeautifulSoup在lxml之上。尤其是，在樹的任何部分調用prettify()都會創建一個獨特的表示形式，以忽略輸入中的空白和奇怪的格式。輸出字符串有點冗長，但它們工作。我繼續用「虛假換行符」（'\n' -> '\\n'）替換換行符，以便輸出更加緊湊。

from collections import defaultdict 
from bs4 import BeautifulSoup as Soup 

tree4='''<A attribute1="1" attribute2="2"><B><C/></B></A>''' 
tree5='''<A attribute1="1"  attribute2="2" > 
<B><C/></B></A>''' 
tree6='''<A attribute2="2" attribute1="1"><B><C/></B></A>''' 
list_trees = [tree4, tree5, tree6] 

mydict = defaultdict(int) 
for tree in list_trees: 
    root = Soup(tree, 'lxml-xml') # Use the LXML XML parser. 
    for sub_root in root.find_all(): 
     print(sub_root) 
     mydict[sub_root.prettify().replace('\n', '\\n')] += 1 

print('Results') 
for key, value in mydict.items(): 
    print(u'%s: %s' % (key, value))

打印出所期望的結果（有一些額外的新行和空格）：

$蟒蛇counter.py

<A attribute1="1" attribute2="2">\n <B>\n <C/>\n </B>\n</A>: 3 
<B>\n <C/>\n</B>: 3 
<C/>\n: 3

來源

2017-04-26 19:41:40 supersam654

謝謝！我發現mydict [sub_root.prettify（）。replace（'\ n'，''）] + = 1是需要的。與此我還沒有發現一個案件，但沒有奏效。 – user2015487

如何檢查lxml元素樹字符串？

回答

相關問題