如何使用Python提取xml文檔中的標籤偏移量BeautifulSoup

-1

我需要一些幫助來查找XML文檔中某些標籤的文本偏移量。我有一個數據集，其格式如下，其中ROOT元素包含多個RECORD，但每個RECORD只包含一個TEXT元素。在文本中可能存在幾個TAG元素用作某些文本的註釋。我需要將這些註釋轉換爲另一種需要使用Python標記的開始和結束偏移的格式。如何使用Python提取xml文檔中的標籤偏移量BeautifulSoup

<ROOT> 
    <RECORD ID="123"> 
     <TEXT> 
     This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem. 
     </TEXT> 
    </RECORD> 
</ROOT>

基本上，我想上面的格式轉換爲以下格式：

<ROOT> 
    <RECORD ID="123"> 
     <TEXT> 
     This is an example text written at December 29th to illustrate the problem. 
     </TEXT> 
     <TAG TYPE="DATE" BEGIN=36 END=49/> 
    </RECORD> 
</ROOT>

我一直在使用BeautifulSoup嘗試，但找不到提取標籤偏移的方式。想法任何人？

感謝您的幫助！

/雅

來源

2014-12-29 jaxah

爲什麼這會降低投票率？編輯 – ShreevatsaR

的想法是遍歷所有TEXT節點，發現裏面所有TAG節點，獲取每個TAG的位置的文字TEXT內「在RECORD水平s的文字和create new tag，然後unwrap()的TAG從TEXT：

from bs4 import BeautifulSoup 

data = """ 
<ROOT> 
    <RECORD ID="123"> 
     <TEXT> 
This is an example text written at <TAG TYPE="DATE">December 29th</TAG> to illustrate the problem. 
     </TEXT> 
    </RECORD> 
</ROOT> 
""" 

soup = BeautifulSoup(data, "xml") 

for text in soup.find_all('TEXT'): 

    record = text.parent 
    for tag in text.find_all('TAG'): 
     begin = text.text.index(tag.text) 
     end = len(tag.text) + begin 

     record.append(soup.new_tag(tag.name, BEGIN=begin, END=end)) 

     tag.unwrap() 

print soup

打印：

<?xml version="1.0" encoding="utf-8"?> 
<ROOT> 
<RECORD ID="123"> 
<TEXT> 
This is an example text written at December 29th to illustrate the problem. 
     </TEXT> 
<TAG BEGIN="36" END="49"/></RECORD> 
</ROOT>

注意：如果多TAG小號出現在TEXT水平沒有測試它。但至少它應該給你一個出發點。

來源

2014-12-29 11:03:29 alecxe

感謝您的回答，雖然如果幾個標籤具有相同的內容，但仍然存在問題，但我會解決這個問題。 – jaxah

這是一個非常聰明的黑客。感謝這個想法。 – ShreevatsaR

通過lxml.etree

from lxml import etree 
root = etree.fromstring(data) 
insert_tag = etree.Element("TAG") 
insert_t_attib = insert_tag.attrib 
insert_t_attib["TYPE"] = "DATE" 

for i in root.getiterator("TAG"): 
    tag_text = i.text.strip() 
    p = i.getparent() 
    etree.strip_tags(p, "TAG") 
    pp = p.getparent() 
    p_text = p.text.strip() 
    begin = p_text.find(tag_text) 
    end = begin + len(tag_text) 
    insert_t_attib = insert_tag.attrib 
    insert_t_attib["BEGIN"] = str(begin) 
    insert_t_attib["END"] = str(end) 

    pp.insert(pp.getchildren().index(p)+1, insert_tag) 


print etree.tostring(root) 

<ROOT> 
    <RECORD ID="123"> 
     <TEXT> 
     This is an example text written at December 29th to illustrate the problem. 
     </TEXT> 
    <TAG TYPE="DATE" BEGIN="35" END="48"/></RECORD> 
</ROOT>

來源

2014-12-29 11:03:12

以獲取TAG標籤的開始和結束值。 –

如何使用Python提取xml文檔中的標籤偏移量BeautifulSoup

回答

相關問題