從AntiXSS v3輸出中淨化html編碼的文本（#decimal notation）

我正在搭建一個博客引擎XSS安全的註釋。嘗試了很多不同的方法，但發現非常困難。從AntiXSS v3輸出中淨化html編碼的文本（#decimal notation）

當我顯示評論時，我首先使用Microsoft AntiXss 3.0來對html進行編碼。然後，我嘗試使用白名單方法對html安全標籤進行解碼。

一直在尋找在阿特伍德的「sanitize HTML」線程在refactormycode。

我的問題是，AntiXss庫將值編碼爲& #DECIMAL;記譜法，我不知道如何重寫史蒂夫的例子，因爲我的正則表達式知識是有限的。

我試了下面的代碼，我簡單地將實體替換爲小數形式，但它不能正常工作。

&lt; with &#60; 
&gt; with &#62;

我重寫：

class HtmlSanitizer 
{ 
    /// <summary> 
    /// A regex that matches things that look like a HTML tag after HtmlEncoding. Splits the input so we can get discrete 
    /// chunks that start with &lt; and ends with either end of line or &gt; 
    /// </summary> 
    private static Regex _tags = new Regex("&#60;(?!&#62;).+?(&#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled); 


    /// <summary> 
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode 
    /// FIXME - Could be improved, since this might decode &gt; etc in the middle of 
    /// an a/link tag (i.e. in the text in between the opening and closing tag) 
    /// </summary> 
    private static Regex _whitelist = new Regex(@" 
^&#60;/?(a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&#62;$ 
|^&#60;(b|h)r\s?/?&#62;$ 
|^&#60;a(?!&#62;).+?&#62;$ 
|^&#60;img(?!&#62;).+?/?&#62;$", 


     RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace | 
     RegexOptions.ExplicitCapture | RegexOptions.Compiled); 

    /// <summary> 
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags 
    /// </summary> 
    public static string Sanitize(string html) 
    { 

     string tagname = ""; 
     Match tag; 
     MatchCollection tags = _tags.Matches(html); 
     string safeHtml = ""; 

     // iterate through all HTML tags in the input 
     for (int i = tags.Count - 1; i > -1; i--) 
     { 
      tag = tags[i]; 
      tagname = tag.Value.ToLowerInvariant(); 

      if (_whitelist.IsMatch(tagname)) 
      { 
       // If we find a tag on the whitelist, run it through 
       // HtmlDecode, and re-insert it into the text 
       safeHtml = HttpUtility.HtmlDecode(tag.Value); 
       html = html.Remove(tag.Index, tag.Length); 
       html = html.Insert(tag.Index, safeHtml); 
      } 

     } 

     return html; 
    } 

}

我的輸入測試HTML是：

<p><script language="javascript">alert('XSS')</script><b>bold should work</b></p>

AntiXss後會變成：當我運行消毒的版本

&#60;p&#62;&#60;script language&#61;&#34;javascript&#34;&#62;alert&#40;&#39;XSS&#39;&#41;&#60;&#47;script&#62;&#60;b&#62;bold should work&#60;&#47;b&#62;&#60;&#47;p&#62;

（字符串html），它給了我：

<p><script language="javascript">alert&#40;&#39;XSS&#39;&#41;</script><b>bold should work</b></p>

正則表達式匹配我不想要的白名單中的腳本。任何幫助，將不勝感激。

來源

2008-12-28 jesperlind

只記得這一點：http://www.codinghorror.com/blog/archives/001171.html – some 2008-12-28 16:01:50

我一直都在這些鏈接的最後24小時。不能相信它必須如此複雜。正如他們在關於CSRF文章「Web開發缺乏可怕性」的評論中引用的那樣是非常真實的。 – jesperlind 2008-12-28 16:33:48

謹防白名單IMG標籤。 onerror屬性可用於插入腳本。 – PEZ 2008-12-28 16:37:29

你的問題是C＃錯誤地解釋你的正則表達式。你需要逃避＃號。如果沒有逃脫，它會匹配太多。

private static Regex _whitelist = new Regex(@" 
    ^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$ 
    |^&\#60;(b|h)r\s?(&\#47;)?&\#62;$ 
    |^&\#60;a(?!&\#62;).+?&\#62;$ 
    |^&\#60;img(?!&\#62;).+?(&\#47;)?&\#62;$", 

    RegexOptions.Singleline | 
    RegexOptions.IgnorePatternWhitespace | 
    RegexOptions.ExplicitCapture 
    RegexOptions.Compiled 
);

更新2：你可能有興趣在這個xss和regexp網站。

來源

2008-12-28 17:13:06 some

您是否考慮過使用Markdown或VBCode或一些類似的方法讓用戶將其評論標記爲？那麼你可以禁止所有的HTML。

如果您必須允許HTML，那麼我會考慮使用HTML解析器（本着HTMLTidy的精神）並在那裏進行白名單。

來源

2008-12-28 16:13:06 PEZ

是的我正在使用WMD編輯器降價，但我希望用戶能夠發佈HTML和代碼示例，如堆棧溢出，所以我不想完全禁止HTML。

我一直在尋找HTML Tidy，但還沒有嘗試過。然而，我使用Html Agility Pack來確保HTML是正確的（沒有孤立標籤）。這是在我運行AntiXss之前完成的。

如果我不能讓我當前的解決方案按我喜歡的方式工作，我會嘗試HTML Tidy，感謝您的建議。

來源

2008-12-28 16:30:17 jesperlind

我在Mac上，所以我無法測試您的C＃代碼。但對我來說，你似乎應該讓_whitelist正則表達式只適用於標籤名稱。這可能意味着你必須進行兩次傳球，一次是打開，一次是關閉標籤。但它會使它簡單得多。

來源

2008-12-28 16:51:06 PEZ

如果有人有興趣使用它，我會在這裏發佈完整的代碼（輕微重構和更新的評論）。

我還決定從@白色名片中刪除img標籤，@some指出這可能是危險的。

還必須指出，我沒有對可能的XSS攻擊進行適當的測試。對於我來說這個方法的效果如何，這只是一個說明。

class HtmlSanitizer 
{ 
    /// <summary> 
    /// A regex that matches things that look like a HTML tag after HtmlEncoding to &#DECIMAL; notation. Microsoft AntiXSS 3.0 can be used to preform this. Splits the input so we can get discrete 
    /// chunks that start with &#60; and ends with either end of line or &#62; 
    /// </summary> 
    private static readonly Regex _tags = new Regex(@"&\#60;(?!&\#62;).+?(&\#62;|$)", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.Compiled); 


    /// <summary> 
    /// A regex that will match tags on the whitelist, so we can run them through 
    /// HttpUtility.HtmlDecode 
    /// FIXME - Could be improved, since this might decode &#60; etc in the middle of 
    /// an a/link tag (i.e. in the text in between the opening and closing tag) 
    /// </summary> 

    private static readonly Regex _whitelist = new Regex(@" 
^&\#60;(&\#47;)? (a|b(lockquote)?|code|em|h(1|2|3)|i|li|ol|p(re)?|s(ub|up|trong|trike)?|ul)&\#62;$ 
|^&\#60;(b|h)r\s?(&\#47;)?&\#62;$ 
|^&\#60;a(?!&\#62;).+?&\#62;$", 


     RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace | 
     RegexOptions.ExplicitCapture | RegexOptions.Compiled); 

    /// <summary> 
    /// HtmlDecode any potentially safe HTML tags from the provided HtmlEncoded HTML input using 
    /// a whitelist based approach, leaving the dangerous tags Encoded HTML tags 
    /// </summary> 
    public static string Sanitize(string html) 
    { 
     Match tag; 
     MatchCollection tags = _tags.Matches(html); 

     // iterate through all HTML tags in the input 
     for (int i = tags.Count - 1; i > -1; i--) 
     { 
      tag = tags[i]; 
      string tagname = tag.Value.ToLowerInvariant(); 

      if (_whitelist.IsMatch(tagname)) 
      { 
       // If we find a tag on the whitelist, run it through 
       // HtmlDecode, and re-insert it into the text 
       string safeHtml = HttpUtility.HtmlDecode(tag.Value); 
       html = html.Remove(tag.Index, tag.Length); 
       html = html.Insert(tag.Index, safeHtml); 
      } 
     } 
     return html; 
    } 
}

來源

2008-12-28 18:17:37 jesperlind

從AntiXSS v3輸出中淨化html編碼的文本（#decimal notation）

回答

相關問題