我有一個我在內容管理系統上掛了一段時間的coldfusion腳本。它使用正則表達式從內容中去除任何糟糕的標籤和字符。正則表達式剝離HTML標籤
我需要停止此腳本以去除任何<object>
和標記。
我給它一個去,但我認爲這是超出了我的正則表達式技能。
<cfparam name="Attributes.allowedclasses" default="">
<!--- turn allowed classes list to regular expression --->
<cfset Attributes.allowedclasses = Replace(Attributes.allowedclasses, ",", "|", "all")>
<cfset vBody="<body style='font-family:Verdana; font-size:12px;'>">
<cfset vStart="<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Transitional//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'><html xmlns='http://www.w3.org/1999/xhtml' lang='en' xml:lang='en'><head><title>Title</title></head>#vBody#">
<cfset vEnd="</body></html>">
<cfloop list="#Attributes.varnames#" index="theVariable">
<cfset vIntVar=evaluate("caller.#theVariable#")>
<cf_bocctrimformvars varnames="vIntVar" allowhtml="yes" quotes="unescape" allowPound="yes">
<cfset vIntVarDebug=vIntVar>
<!--- strip copy and paste word etc code formatting --->
<cfset vIntVar=ReReplaceNoCase(vIntVar, "</?[a-z0-9-=""'!\$\?%&\*\[email protected]~##;,\\]*:[a-z0-9 -=""'!\$\?%&\*\[email protected]~##;,\\]*>", "", "all")>
<!--- stop certain classes being stripped out --->
<cfif ListLen(Attributes.allowedclasses) NEQ 0>
<cfset vIntVar=ReReplaceNoCase(vIntVar, '<span class="(#Attributes.allowedclasses#)">([\s\S]*?)</span>', '<excludespan classexclude="\1">\2</excludespan>', 'all')>
<!--- stop other classes being stripped out --->
<cfset vIntVar=ReReplaceNoCase(vIntVar, '<([a-z0-9]+) class="(#Attributes.allowedclasses#)"[^>]*>', '<\1 classexclude="\2">', 'all')>
</cfif>
<!--- strip out span and font tags --->
<cfset vIntVar=ReReplaceNoCase(vIntVar, "</?(span|font)[^>]*>", "", "all")>
<!--- strip out rest of styles/classes --->
<cfset vIntVar=ReReplaceNoCase(vIntVar, "<([a-z0-9]+) (style|class)=[^>]*>", "<\1>", "all")>
<!--- reset classes which shouldn't be stripped out --->
<cfif ListLen(Attributes.allowedclasses) NEQ 0>
<cfset vIntVar=ReReplaceNoCase(vIntVar, '<excludespan classexclude="([a-z0-9-]+)"[^>]*>', '<span class="\1">', 'all')>
<cfset vIntVar=ReplaceNoCase(vIntVar, '</excludespan>', '</span>', 'all')>
<cfset vIntVar=ReReplaceNoCase(vIntVar, '<([a-z0-9]+) classexclude="([a-z0-9-]+)"[^>]*>', '<\1 class="\2">', 'all')>
</cfif>
<cfset vIntVar=ReReplaceNoCase(vIntVar, "<\?xml[^>]*>", "", "all")>
<cfset vIntVar=ReReplaceNoCase(vIntVar, "<p>([[:space:]])*</p>", "", "all")>
<cfset vIntVar=ReReplaceNoCase(vIntVar, "</?U>", "", "all")>
<cfset vIntVar=ReReplaceNoCase(vIntVar, "</?DIV[^>]*>", "", "all")>
<cfset vIntVar=ReReplaceNoCase(vIntVar, "</?PRE>", "", "all")>
<cfset vIntVar=ReplaceNoCase(vIntVar, 'target=""', '', 'all')>
<!---
DG 19/9/2004: fix put in to swap round <p> and <a> tags if a single <p> is inside an <a>
(which html tidy doesn't like
--->
<cfset vIntVar=ReReplaceNoCase(vIntVar, "<a([[:print:]]*)>[[:space:]]*<p>([[:print:]]*)</p>([[:space:]]*)</a>", "<p><a\1>\2</a></p>", 'all')>
<cfset vIntVar=vStart & vIntVar & vEnd>
<cflock name="tidy" type="exclusive" timeout="10">
<cfscript>
TidyObj = CreateObject("COM", "TidyCOM.TidyObject");
TidyOptions = TidyObj.Options;
TidyOptions.Doctype = "omit";
TidyOptions.TidyMark = false;
TidyOptions.OutputXml = false;
TidyOptions.InputXml = false;
TidyOptions.OutputXhtml = true;
TidyOptions.ShowWarnings = false;
TidyOptions.DropEmptyParas = true;
TidyOptions.Quiet = true;
TidyOptions.Indent = 0;
TidyOptions.Wrap = 0;
TidyOptions.QuoteAmpersand = true;
vIntVar = TidyObj.TidyMemToMem(vIntVar);
TidyObj = "";
</cfscript>
</cflock>
<!--- strip any image tags inserted by drag and drop etc --->
<cfset vIntVar=ReReplaceNoCase(vIntVar, "<img [^>]*>", "", "all")>
</cfloop>
我應該警告你,要求解析-html-regex問題往往會在這裏有點皺眉 - 看到這個:http://stackoverflow.com/questions/1732348/regex-match-open-tags-除了-xhtml-self-contained-tags/1732454#1732454 – 2010-11-25 10:48:52
我可以理解它不受歡迎。這是一個老腳本。現在我只需要一個快速修復,而不是重寫它。 – Sam 2010-11-25 11:00:26