您需要遞歸分組公共標籤。這裏的實施,允許通過功能,決定如何處理文本做:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import itertools
import operator
import os.path
from lxml import etree
text = """
<root>
<data>
<slide name="file.xml">
<subtitle>Text1</subtitle>
<MainTitle>Text2</MainTitle>
<MainTitle>text3</MainTitle>
</slide>
<slide name="file.xml">
<Title>String1</Title>
<Title>String2</Title>
<Title>String3</Title>
<Title>String4</Title>
<Title>String5</Title>
<Title>String6</Title>
<Title>String7</Title>
<Title>String8</Title>
</slide>
</data>
</root>
"""
def combine_elements(elements, combine_text=', '.join):
result = []
for key, group in itertools.groupby(elements, operator.attrgetter('tag')):
items = list(group)
first_item = items[0]
# combine only if item don't have children
if len(items) > 1 and not len(first_item):
combined = combine_text([el.text for el in items])
# and if combine_text returned something, e.g. strings have
# common prefix
if combined:
first_item.text = combined
result.append(first_item)
continue
result.extend(items)
elements[:] = result
# recursively combine others
for element in elements:
combine_elements(element, combine_text)
doc = etree.fromstring(text)
combine_elements(doc, os.path.commonprefix)
print etree.tostring(doc)
使用os.path.commonprefix()
爲文本組合,你會得到以下結果:
<root>
<data>
<slide name="file.xml">
<subtitle>Text1</subtitle>
<MainTitle>Text2</MainTitle>
<MainTitle>text3</MainTitle>
</slide>
<slide name="file.xml">
<Title>String</Title>
</slide>
</data>
</root>
如果你希望所有的文本合併用斜線/
(例如)可以使用下列內容:
doc = etree.fromstring(text)
combine_elements(doc, '/'.join)
結果:
<root>
<data>
<slide name="file.xml">
<subtitle>Text1</subtitle>
<MainTitle>Text2/text3</MainTitle>
</slide>
<slide name="file.xml">
<Title>String1/String2/String3/String4/String5/String6/String7/String8</Title>
</slide>
</data>
</root>
如何從所有輸入中獲取字符串 - 取第一個還是公用部分? – Mark 2013-05-01 10:38:43