Python中的正則表達式用於刪除XML註釋和HTML元素

我使用Universal feed Parser解析RSS內容。在描述標籤有時我越來越velues象下面這樣：Python中的正則表達式用於刪除XML註釋和HTML元素

<!--This is the XML comment --> 
<p>This is a Test Paragraph</p></br> 
<b>Sample Bold</b> 
<m:Table>Sampe Text</m:Table>

中序刪除HTML元素/標籤我使用以下正則表達式。

pattern = re.compile(u'<\/?\w+\s*[^>]*?\/?>', re.DOTALL | re.MULTILINE | re.IGNORECASE | re.UNICODE) 
desc = pattern.sub(u" ", desc)

這有助於刪除HTML標籤，但不是XML註釋。如何刪除元素和XML元素？

來源

2011-10-12 Simsons

這不夠嗎？ 'r'<.*?>'' – rplnt

正確的做法是使用XML解析器像@duffymo所說的。嘗試[BeautifulSoup]（http://www.crummy.com/software/BeautifulSoup/） – WilHall

解析器在這種情況下是一個矯枉過正的事情。您不需要知道樹結構，標籤名稱空間，名稱和屬性只是爲了將它們扔掉，是嗎？哦，和@rplnt，你忘了CDATA（'<！[CDATA [有些文本<這不是標籤！>一些更多文本]]>'）。 – pyos

使用lxml：

import lxml.html as LH 

content=''' 
<!--This is the XML comment --> 
<p>This is a Test Paragraph</p></br> 
<b>Sample Bold</b> 
<Table>Sampe Text</Table> 
''' 

doc=LH.fromstring(content) 
print(doc.text_content())

產生

This is a Test Paragraph 
Sample Bold 
Sampe Text

來源

2011-10-12 12:07:52 unutbu

+1不使用正則表達式！ – naeg

以這種方式使用正則表達式是一個壞主意。

我會在使用真正的解析器後導航DOM樹，並刪除我想要的方式。

來源

2011-10-12 11:46:05 duffymo

按照這裏接受的答案http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags。改用美麗的湯。 –

你們來自Ban Regex運動真的嚇到我了。正則表達式不能用於** PARSE ** XML，因爲標籤可以嵌套（''），但它們可以用於** STRIP **標籤因爲標籤只是尖括號之間的任何東西。閱讀維基百科，該死的。（對不起） – pyos

有沒有動作禁止正則表達式，它只是指出每個任務應該使用正確的工具，並且在剝離標籤之前，您必須找到它，並且您將如何做？與正則表達式？餿主意。 –

爲什麼這麼複雜？ re.sub('<!\[CDATA\[(.*?)\]\]>|<.*?>', lambda m: m.group(1) or '', desc, flags=re.DOTALL)

如果您希望XML標記保持不變，那麼您應該檢出一個HTML標記列表http://www.whatwg.org/specs/web-apps/current-work/multipage/並使用'(<!\[CDATA\[.*?\]\]>)||</?(?:tag names separated by pipes)(?:\s.*?)?>'正則表達式。

來源

2011-10-12 11:50:03 pyos

有一個簡單的方法是將它與純Python：

def remove_html_markup(s): 
    tag = False 
    quote = False 
    out = "" 

    for c in s: 
      if c == '<' and not quote: 
       tag = True 
      elif c == '>' and not quote: 
       tag = False 
      elif (c == '"' or c == "'") and tag: 
       quote = not quote 
      elif not tag: 
       out = out + c 

    return out

的想法是在這裏解釋：http://youtu.be/2tu9LTDujbw

你可以看到它在這裏工作：http://youtu.be/HPkNPcYed9M?t=35s

PS - 如果你對該類感興趣（關於python的智能調試），我給你一個鏈接：http://www.udacity.com/overview/Course/cs259/CourseRev/1。免費！

不客氣！

來源

2013-01-22 17:39:55 Medeiros

Python中的正則表達式用於刪除XML註釋和HTML元素

回答

相關問題