白名單目標=「_空白」與正則表達式

我正在處理一個網站，從數據庫中清理輸出，以便允許一些html標記。它使用正則表達式來清理數據。白名單目標=「_空白」與正則表達式

目前，它允許標準 Google（標準HREF沒有目標）但不允許

<a href="http://www.google.com" target="_blank" title="Google">Google</a>

的代碼看起來像這樣的時刻：

private static Regex _tags = new Regex("<[^>]*(>|$)", 
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled); 
private static Regex _whitelist = new Regex(@" 
^</?(b(lockquote)?|code|d(d|t|l|el)|em|h(1|2|3)|i|kbd|u|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)>$| 
^<(b|h)r\s?/?>$", 
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace); 
private static Regex _whitelist_a = new Regex(@" 
^<a\s 
href=""(\#\d+|(https?|ftp)://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+)"" 
(\stitle=""[^""<>]+"")?\s?>$| 
^</a>$", 
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace); 
private static Regex _whitelist_img = new Regex(@" 
^<img\s 
src=""https?://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+"" 
(\swidth=""\d{1,3}"")? 
(\sheight=""\d{1,3}"")? 
(\salt=""[^""<>]*"")? 
(\stitle=""[^""<>]*"")? 
\s?/?>$", 
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace); 


/// <summary> 
/// sanitize any potentially dangerous tags from the provided raw HTML input using 
/// a whitelist based approach, leaving the "safe" HTML tags 
/// CODESNIPPET:4100A61A-1711-4366-B0B0-144D1179A937 
/// </summary> 
public static string Sanitize(string html) 
{ 
    if (String.IsNullOrEmpty(html)) return html; 

    string tagname; 
    Match tag; 

    // match every HTML tag in the input 
    MatchCollection tags = _tags.Matches(html); 
    for (int i = tags.Count - 1; i > -1; i--) 
    { 
     tag = tags[i]; 
     tagname = tag.Value.ToLowerInvariant(); 

     if (!(_whitelist.IsMatch(tagname) || _whitelist_a.IsMatch(tagname) || _whitelist_img.IsMatch(tagname))) 
     { 
      html = html.Remove(tag.Index, tag.Length); 

     } 
    } 

    return html; 
}

我會就像允許有目標的hrefs一樣。

任何幫助，這將是偉大的，謝謝。

來源

2011-08-16 Andrew Cassidy

正則表達式不適合此目的。您需要使用HTML解析器。 – ThiefMaster

被編輯爲在評論中包含第二個請求。

變化：

private static Regex _whitelist_a = new Regex(@" 
^<a\s 
href=""(\#\d+|(https?|ftp)://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+)"" 
(\stitle=""[^""<>]+"")?\s?>$| 
^</a>$", 
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);

到：

private static Regex _whitelist_a = new Regex(@" 
^<a(\starget=""[^""<>]+"")?\s 
href=""(\#\d+|(https?|ftp)://[-a-z0-9+&@#/%?=~_|!:,.;\(\)]+)"" 
(\starget=""[^""<>]+"")?(\stitle=""[^""<>]+"")?\s?>$| 
^</a>$", 
RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);

這不一定是完美的解決方案，但是這將允許一個「目標」前和「HREF」之後，或之前，或之後，或者根本不。

你應該能夠創建一個正則表達式是更簡潔，與此類似：

^<a(\s+(?:target|href|title)="[^"<>]+")*\s*>$|^</a>$

但我不知道究竟如何，你會在你的代碼寫這篇文章，因爲我不熟悉C＃或.Net。但是你可以嘗試以下方法：

private static Regex _whitelist_a = new Regex(
    @"^<a(\s+(?:target|href|title)=""[^""<>]+"")*\s*>$|^</a>$", 
    RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace 
);

該解決方案在上述解決方案的優點是，它會允許任何的href，target和title以任何順序，並與任意數量的在它們之間的空間。

來源

2011-08-16 15:46:29

工作完美，非常感謝您的幫助。 –

嗨，這是工作，謝謝，但它也有可能爲用戶輸入Google以及Google（現在工作）是否有可能爲此添加一個規則？謝謝 –

我更新了我的答案，希望能夠解決第二個問題。 –

白名單目標=「_空白」與正則表達式

回答

相關問題