我越來越有以下matter.Let的發言權麻煩匹配,我有兩個列表中的某些字符串在詞典:字符串在python
left right
british 7
cuneate nucleus Medulla oblongata
Motoneurons anterior
而且我有一個文件像下面的一些測試線:
<s id="69-7">British Meanwhile is the studio 7 album by british pop band 10cc 7.</s>
<s id="5239778-2">Medulla oblongata,the name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, motoneurons located in the spinal.</s>
我想輸出像下面的樣子:
<s id="69-7"><w2>British</w2> Meanwhile is the studio <w2>7</w2> album by <w1>british</w1> pop band 10cc <w2>7</w2>.</s>
<s id="5239778-2"><w2>Medulla oblongata</w2>,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the <w2>medulla oblongata</w2>.</s>
我試着用下面的代碼:
import re
def textReturn(left, right):
text = ""
filetext = open(text.xml, "r").read()
linelist = re.split(u'[\n|\r\n]+',filetext)
for i in linelist:
left = left.strip()
right = right.strip()
if left in i and right in i:
i1 = re.sub('(?i)(\s+)(%s)(\s+)'%left, '\\1<w1>\\2</w1>\\3', i)
i2 = re.sub('(?i)(\s+)(%s)(\s+)'%right, '\\1<w2>\\2</w2>\\3', i1)
text = text + i2 + "\n"
return text
但它給我:
'<s id="69-7">British meanwhile is the studio <w2>7</w2> album by <w1>British</w1> pop band 10cc 7.</s>'.
<s id="5239778-2">Medulla oblongata,the name refers collectively to the <w1>cuneate nucleus</w1> and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="21120-99">Terior horn cells, <w1>motoneurons</w2> located in the spinal.</s>
即如果有開頭&端線,不能標記。
此外,我只是想得到返回這些行,其中匹配左&右字符串,而不是其他行。
任何解決方案,請!非常感謝!!!
該輸入看起來像XML。你確定你不需要用XML解析器拉出字符串嗎?此外,RE真的應該使用原始字符串(r'...'),因爲它們不會專門處理反斜槓。 – Keith
Keith有一個好點。依靠整個'''元素在一條線上可能不是一個好主意。如果考慮字面字符串,CDATA部分,處理指令等,只能自己找到元素,但爲什麼當xml解析器已經爲你做了這些工作?有一個學習曲線來使用它們,以及XSLT(用於按照你想要的方式修改文檔),但它是如此值得! –