檢索特定文本中游離鹼/ WEX蟒蛇重新

讓我們說我有文字這樣檢索特定文本中游離鹼/ WEX蟒蛇重新

This is something before any tag, today's date is 09-06-2012 blah blah 
<firsttag> content of first tag </firsttag> <sentence> This is the 
first sentence in my paragraph that needs to be <bold> displayed. 
</bold> </sentence> <secondtag> blah blah blah <italics> another blah 
</italics></secondtag> <sentence> This is the second sentence in my 
paragraph that needs to be displayed and it has some weird contents 
like \n\n\n and inbetween reference tags like <link> http://google.com 
</link></sentence> <thirdtag>blah blah </thirdtag><sentence>Tennis is 
a great sport, I'm really sad that <link 
synthetic="True"><target>Roger Federer </link></target>Roger Federer 
lost yesterday.</sentence>

輸出/字符串應該是這樣的

這是我的一段第一句需要被顯示。這是我的段落中的第二句話，需要顯示，它有一些奇怪的內容，如中間和參考標籤像網球是一項偉大的運動，我真的很傷心，羅傑費德勒失去了yest erday。

我正則表達式解析後輸出應該只有我們裏面的內容和標籤。所有標籤，奇怪的\ n \ n字符以及所有垃圾內容都需要刪除，就像「羅傑·費德勒」那樣，鏈接僅僅指向羅傑·費德勒的頁面，因爲這是一個Freebase-wiki（WEX）我正在處理的數據集。一個簡單的python re代碼來幫助我解決這個問題將是非常有用的。我正在嘗試的代碼是這樣的。

for line in fileinput.input(): 
     p = re.sub('<[^>]*>', '', line) 
     p = re.sub('\n','',p) 
print p

因爲我處理龐大的文件，如果你能幫助我用的map-reduce（Hadoop的）代碼，它也將是非常有益的。在此先感謝:)

來源

2012-09-06 crazyim5

我爲您的問題修理了一個自定義解決方案。您必須輸入字符串作爲參數s。

def convert_with_regex(s): 
    sents = re.compile(r"<sentence>(.*?)</sentence>", re.S) 
    fin = re.compile(r"<(.*)>(.*?)</.*>|[\n]+", re.S) 
    result=[] 
    for sent in sents.findall(s.replace("<bold>","").replace("</bold>","")): 
     result.append(fin.sub("",sent)) 
    return ''.join(result)

我知道這是不是優雅，但「形式服從功能」 :)

來源

2012-09-06 20:38:47 halex

檢索特定文本中游離鹼/ WEX蟒蛇重新

回答

相關問題