2013-04-30 76 views
0

我一直在嘗試將類似標籤的值合併爲一個標籤,如下所示。將相似標籤的值組合到一個標籤中

XML輸入:

<root> 
    <data> 
     <slide name="file.xml"> 
      <subtitle>Text1</subtitle> 
      <MainTitle>Text2</MainTitle> 
      <MainTitle>text3</MainTitle> 
     </slide> 
     <slide name="file.xml"> 
      <Title>String1</Title> 
      <Title>String2</Title> 
      <Title>String3</Title> 
      <Title>String4</Title> 
      <Title>String5</Title> 
      <Title>String6</Title> 
      <Title>String7</Title> 
      <Title>String8</Title> 
     </slide> 
    </data> 
</root> 

預期輸出:

<root> 
    <data> 
     <slide name="file.xml"> 
      <subtitle>Text1</subtitle> 
      <MainTitle>Text2</MainTitle> 
      <MainTitle>text3</MainTitle> 
     </slide> 
     <slide name="file.xml"> 
      <Title>String</Title> 
     </slide> 
    </data> 
</root> 

任何幫助將非常感激。謝謝!!

+0

如何從所有輸入中獲取字符串 - 取第一個還是公用部分? – Mark 2013-05-01 10:38:43

回答

0

您需要遞歸分組公共標籤。這裏的實施,允許通過功能,決定如何處理文本做:

#!/usr/bin/env python 
# -*- coding: utf-8 -*- 
import itertools 
import operator 
import os.path 

from lxml import etree 


text = """ 
<root> 
    <data> 
     <slide name="file.xml"> 
      <subtitle>Text1</subtitle> 
      <MainTitle>Text2</MainTitle> 
      <MainTitle>text3</MainTitle> 
     </slide> 
     <slide name="file.xml"> 
      <Title>String1</Title> 
      <Title>String2</Title> 
      <Title>String3</Title> 
      <Title>String4</Title> 
      <Title>String5</Title> 
      <Title>String6</Title> 
      <Title>String7</Title> 
      <Title>String8</Title> 
     </slide> 
    </data> 
</root> 
""" 


def combine_elements(elements, combine_text=', '.join): 
    result = [] 
    for key, group in itertools.groupby(elements, operator.attrgetter('tag')): 
     items = list(group) 
     first_item = items[0] 
     # combine only if item don't have children 
     if len(items) > 1 and not len(first_item): 
      combined = combine_text([el.text for el in items]) 
      # and if combine_text returned something, e.g. strings have 
      # common prefix 
      if combined: 
       first_item.text = combined 
       result.append(first_item) 
       continue 
     result.extend(items) 
    elements[:] = result 
    # recursively combine others 
    for element in elements: 
     combine_elements(element, combine_text) 


doc = etree.fromstring(text) 
combine_elements(doc, os.path.commonprefix) 
print etree.tostring(doc) 

使用os.path.commonprefix()爲文本組合,你會得到以下結果:

<root> 
    <data> 
     <slide name="file.xml"> 
      <subtitle>Text1</subtitle> 
      <MainTitle>Text2</MainTitle> 
      <MainTitle>text3</MainTitle> 
     </slide> 
     <slide name="file.xml"> 
      <Title>String</Title> 
      </slide> 
    </data> 
</root> 

如果你希望所有的文本合併用斜線/(例如)可以使用下列內容:

doc = etree.fromstring(text) 
combine_elements(doc, '/'.join) 

結果:

<root> 
    <data> 
     <slide name="file.xml"> 
      <subtitle>Text1</subtitle> 
      <MainTitle>Text2/text3</MainTitle> 
      </slide> 
     <slide name="file.xml"> 
      <Title>String1/String2/String3/String4/String5/String6/String7/String8</Title> 
      </slide> 
    </data> 
</root> 
相關問題