正則表達式剝離HTML標籤

我有一個我在內容管理系統上掛了一段時間的coldfusion腳本。它使用正則表達式從內容中去除任何糟糕的標籤和字符。正則表達式剝離HTML標籤

我需要停止此腳本以去除任何<object>和標記。

我給它一個去，但我認爲這是超出了我的正則表達式技能。

<cfparam name="Attributes.allowedclasses" default=""> 

<!--- turn allowed classes list to regular expression ---> 
<cfset Attributes.allowedclasses = Replace(Attributes.allowedclasses, ",", "|", "all")> 

<cfset vBody="<body style='font-family:Verdana; font-size:12px;'>"> 
<cfset vStart="<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'><html xmlns='http://www.w3.org/1999/xhtml' lang='en' xml:lang='en'><head><title>Title</title></head>#vBody#"> 
<cfset vEnd="</body></html>"> 

<cfloop list="#Attributes.varnames#" index="theVariable"> 

    <cfset vIntVar=evaluate("caller.#theVariable#")> 

    <cf_bocctrimformvars varnames="vIntVar" allowhtml="yes" quotes="unescape" allowPound="yes"> 

    <cfset vIntVarDebug=vIntVar> 

    <!--- strip copy and paste word etc code formatting ---> 

    <cfset vIntVar=ReReplaceNoCase(vIntVar, "</?[a-z0-9-=""'!\$\?%&\*\[email protected]~##;,\\]*:[a-z0-9 -=""'!\$\?%&\*\[email protected]~##;,\\]*>", "", "all")> 

    <!--- stop certain classes being stripped out ---> 
    <cfif ListLen(Attributes.allowedclasses) NEQ 0> 
     <cfset vIntVar=ReReplaceNoCase(vIntVar, '<span class="(#Attributes.allowedclasses#)">([\s\S]*?)</span>', '<excludespan classexclude="\1">\2</excludespan>', 'all')> 

     <!--- stop other classes being stripped out ---> 
     <cfset vIntVar=ReReplaceNoCase(vIntVar, '<([a-z0-9]+) class="(#Attributes.allowedclasses#)"[^>]*>', '<\1 classexclude="\2">', 'all')> 
    </cfif> 

    <!--- strip out span and font tags ---> 
    <cfset vIntVar=ReReplaceNoCase(vIntVar, "</?(span|font)[^>]*>", "", "all")> 

    <!--- strip out rest of styles/classes ---> 
    <cfset vIntVar=ReReplaceNoCase(vIntVar, "<([a-z0-9]+) (style|class)=[^>]*>", "<\1>", "all")> 

    <!--- reset classes which shouldn't be stripped out ---> 
    <cfif ListLen(Attributes.allowedclasses) NEQ 0> 
     <cfset vIntVar=ReReplaceNoCase(vIntVar, '<excludespan classexclude="([a-z0-9-]+)"[^>]*>', '<span class="\1">', 'all')> 
     <cfset vIntVar=ReplaceNoCase(vIntVar, '</excludespan>', '</span>', 'all')> 

     <cfset vIntVar=ReReplaceNoCase(vIntVar, '<([a-z0-9]+) classexclude="([a-z0-9-]+)"[^>]*>', '<\1 class="\2">', 'all')> 
    </cfif> 



    <cfset vIntVar=ReReplaceNoCase(vIntVar, "<\?xml[^>]*>", "", "all")> 
    <cfset vIntVar=ReReplaceNoCase(vIntVar, "<p>([[:space:]])*</p>", "", "all")> 
    <cfset vIntVar=ReReplaceNoCase(vIntVar, "</?U>", "", "all")> 
    <cfset vIntVar=ReReplaceNoCase(vIntVar, "</?DIV[^>]*>", "", "all")> 
    <cfset vIntVar=ReReplaceNoCase(vIntVar, "</?PRE>", "", "all")> 
    <cfset vIntVar=ReplaceNoCase(vIntVar, 'target=""', '', 'all')> 

    <!--- 
    DG 19/9/2004: fix put in to swap round <p> and <a> tags if a single <p> is inside an <a> 
    (which html tidy doesn't like 
    ---> 
    <cfset vIntVar=ReReplaceNoCase(vIntVar, "<a([[:print:]]*)>[[:space:]]*<p>([[:print:]]*)</p>([[:space:]]*)</a>", "<p><a\1>\2</a></p>", 'all')> 

    <cfset vIntVar=vStart & vIntVar & vEnd> 

    <cflock name="tidy" type="exclusive" timeout="10"> 
     <cfscript> 
     TidyObj = CreateObject("COM", "TidyCOM.TidyObject"); 
     TidyOptions = TidyObj.Options; 
     TidyOptions.Doctype = "omit"; 
     TidyOptions.TidyMark = false; 
     TidyOptions.OutputXml = false; 
     TidyOptions.InputXml = false; 
     TidyOptions.OutputXhtml = true; 
     TidyOptions.ShowWarnings = false; 
     TidyOptions.DropEmptyParas = true; 
     TidyOptions.Quiet = true; 
     TidyOptions.Indent = 0; 
     TidyOptions.Wrap = 0; 
     TidyOptions.QuoteAmpersand = true; 

     vIntVar = TidyObj.TidyMemToMem(vIntVar); 

     TidyObj = ""; 
     </cfscript> 
    </cflock> 


    <!--- strip any image tags inserted by drag and drop etc ---> 
    <cfset vIntVar=ReReplaceNoCase(vIntVar, "<img [^>]*>", "", "all")> 


</cfloop>

來源

2010-11-25 Sam

我應該警告你，要求解析-html-regex問題往往會在這裏有點皺眉 - 看到這個：http://stackoverflow.com/questions/1732348/regex-match-open-tags-除了-xhtml-self-contained-tags/1732454＃1732454 – 2010-11-25 10:48:52

我可以理解它不受歡迎。這是一個老腳本。現在我只需要一個快速修復，而不是重寫它。 – Sam 2010-11-25 11:00:26

我orangepips同意，你應該問一個更具體的問題，但我也喜歡挑戰。我曾嘗試使用REGEX解析HTML，並可以證明它不是一個好的解決方案，特別是當您查看整個文檔而不僅僅是一個簡單的字符串時。但是，有時您必須在狹窄的空間中工作，而且您沒有太多選擇。

我查看了所有您在此處使用的REGEX表達式，並將它們全部針對以下對象標記運行。沒有一個人發現了這個對象標籤，這讓我相信這個問題可能在TidyCOM中。我戳了一下尋找有關TidyCOM的信息，以及我可以找到的最新的東西是從2001年左右開始的。

我知道您只是希望修復此腳本並繼續前進，但這可能是不可能的。您可能會開始考慮將這些遺留問題遷移到更新的平臺中。

如果您想要確定問題在連接vStart，vIntVar和vEnd後將vIntVar變量輸出到文本文件的位置。當然，你也可以使用CF調試器，但是我可以記得，這並不是最簡單的工作。

對象標記我用來測試的表達式：

<object classid="clsid:F08DF954-8592-11D1-B16A-00C0F0283628" id="Slider1" width="100" height="50"> 
    <param name="BorderStyle" value="1" /> 
    <param name="MousePointer" value="0" /> 
    <param name="Enabled" value="1" /> 
    <param name="Min" value="0" /> 
    <param name="Max" value="10" /> 
</object>

如果你需要一些幫助理解什麼是正則表達式的表情都在做，我發現Expresso是一個偉大的工具。還有其他的，但這是我多年來使用的一個，它完成了工作。

來源

2010-11-26 01:30:14

正則表達式剝離HTML標籤

回答

相關問題