字符串在python

我越來越有以下matter.Let的發言權麻煩匹配，我有兩個列表中的某些字符串在詞典：字符串在python

left        right 
british        7 
cuneate nucleus      Medulla oblongata 
Motoneurons       anterior

而且我有一個文件像下面的一些測試線：

<s id="69-7">British Meanwhile is the studio 7 album by british pop band 10cc 7.</s> 
<s id="5239778-2">Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s> 
<s id="21120-99">Terior horn cells, motoneurons located in the spinal.</s>

我想輸出像下面的樣子：

<s id="69-7"><w2>British</w2> Meanwhile is the studio <w2>7</w2> album by <w1>british</w1> pop band 10cc <w2>7</w2>.</s> 
<s id="5239778-2"><w2>Medulla oblongata</w2>,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the <w2>medulla oblongata</w2>.</s>

我試着用下面的代碼：

import re 

def textReturn(left, right): 
    text = "" 
    filetext = open(text.xml, "r").read() 
    linelist = re.split(u'[\n|\r\n]+',filetext) 

    for i in linelist: 
     left = left.strip() 
     right = right.strip() 

     if left in i and right in i: 
      i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', i) 
      i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', i1) 
      text = text + i2 + "\n"   
    return text

但它給我：

'<s id="69-7">British meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc 7.</s>'. 
<s id="5239778-2">Medulla oblongata,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s> 
<s id="21120-99">Terior horn cells, <w1>motoneurons</w2> located in the spinal.</s>

即如果有開頭&端線，不能標記。

此外，我只是想得到返回這些行，其中匹配左&右字符串，而不是其他行。

任何解決方案，請！非常感謝！！！

來源

2011-07-31 Liza

該輸入看起來像XML。你確定你不需要用XML解析器拉出字符串嗎？此外，RE真的應該使用原始字符串（r'...'），因爲它們不會專門處理反斜槓。 – Keith

Keith有一個好點。依靠整個'''元素在一條線上可能不是一個好主意。如果考慮字面字符串，CDATA部分，處理指令等，只能自己找到元素，但爲什麼當xml解析器已經爲你做了這些工作？有一個學習曲線來使用它們，以及XSLT（用於按照你想要的方式修改文檔），但它是如此值得！ –

它不會在開始和結束標記，因爲您期望一個或多個空格前後關鍵字。可以使用\b（分詞）。

附錄

實際代碼：

import re 

dict = [('british','7'),('cuneate nucleus','Medulla oblongata'),('Motoneurons','anterior')] 

filetext = """<s id="69-7">British Meanwhile is the studio 7 album by british pop band 10cc 7.</s> 
<s id="5239778-2">Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s> 
<s id="21120-99">Terior horn cells, motoneurons located in the spinal.</s> 
""" 

linelist = re.split(u'[\n|\r\n]+', filetext) 

s_tag = re.compile(r"(<s[^>]+>)(.*?)(</s>)") 

for i in range(3): 
    left, right = dict[i] 

    line_parts = re.search(s_tag, linelist[i]) 
    start = line_parts.group(1) 
    content = line_parts.group(2) 
    end = line_parts.group(3) 

    left_match = "(?i)\\b(%s)\\b" % left 
    right_match = "(?i)\\b(%s)\\b" % right 
    if re.search(left_match, content) and re.search(right_match, content): 
     line1 = re.sub(left_match, '<w1>\\1</w1>', content) 
     line2 = re.sub(right_match, '<w2>\\1</w2>', line1) 
     print(line_parts.group(1) + line2 + line_parts.group(3))

這是一個短期的解決方案的基礎，但長期的，你應該嘗試一下XML解析器方法。

來源

2011-07-31 18:21:47

問題仍然存在！ – Liza

好的，我會爲你工作......另外我還會在'r'中添加一些字符串。給我10分鐘左右。 –

提前致謝！ – Liza

如果你的輸入文件將是一個XML文件，爲什麼不使用XML解析器？請參閱：19.5. xml.parsers.expat — Fast XML parsing using Expat

來源

2011-07-31 18:22:29 yasouser

字符串在python

回答

相關問題