從消息刪除HTML安全

-1

我需要輸出所有的明文的，其可以包括有效和/或無效的HTML和可能的文本是表面上類似於HTML消息內（內即非HTML文本<...>如：< why would someone do this?? >）。從消息刪除HTML安全

保留所有非HTML內容比刪除所有HTML更重要，但理想情況下，我希望擺脫儘可能多的HTML以提高可讀性。

我目前使用的HTML敏捷性包，但我有問題，其中內<和>非HTML也將被刪除，例如：

我的功能：

text = HttpUtility.HtmlDecode(text); 
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
doc.LoadHtml(text); 
text = doc.DocumentNode.InnerText;

簡單的例子，輸入* ：

this text has <b>weird < things</b> going on >

實際輸出（不可接受的，失去了單詞「東西」）：

this text has weird going on >

所需的輸出：

this text has weird < things going on >

有沒有一種方法，以消除HTML敏捷性包內唯一合法的HTML標籤不剝出其他內容可能包括<和/或>？或者是否需要手動創建一個白名單標籤以刪除，如this question？這是我的回退解決方案，但我希望有一個更完整的解決方案內置於HTML敏捷包（或其他工具），我只是無法找到。

*（實際輸入往往有一噸的它不需要HTML的，我可以給一個較長的例子，如果這樣做是有用的）

來源

2017-09-12 violaceous

當處理破損的HTML時，你將會有缺陷。 HTMLAgilityPack將「東西」解釋爲HTML內容的一部分並不奇怪。當HTML無效時，圖書館必須使用啓發式方法進行猜測，這些啓發式算法並不完美。即使你像Kevin在答案中編寫自己的解析器一樣，你也不會變得更好。 – Amy

我找到了正則表達式'/ <[^>]> /'是找到並移除標籤的好方法。所以'Regex.Replace（輸入，「<[^>」>「，」「）'應該是一個很好的起點。儘管如此，避免完全解析HTML會更好。 –

我寫了這個很長一段時間以前做類似的事情。您可以使用它作爲一個起點：

你需要：

using System; 
using System.Collections.Generic;

，代碼：

/// <summary> 
/// Instances of this class strip HTML/XML tags from a string 
/// </summary> 
public class HTMLStripper 
{ 
    public HTMLStripper() { } 
    public HTMLStripper(string source) 
    { 
     m_source = source; 
     stripTags(); 
    } 

    private const char m_beginToken = '<'; 
    private const char m_endToken = '>'; 
    private const char m_whiteSpace = ' '; 

    private enum tokenType 
    { 
     nonToken = 0, 
     beginToken = 1, 
     endToken = 2, 
     escapeToken = 3, 
     whiteSpace = 4 
    } 

    private string m_source = string.Empty; 
    private string m_stripped = string.Empty; 
    private string m_tagName = string.Empty; 
    private string m_tag = string.Empty; 
    private Int32 m_startpos = -1; 
    private Int32 m_endpos = -1; 
    private Int32 m_currentpos = -1; 
    private IList<string> m_skipTags = new List<string>(); 
    private bool m_tagFound = false; 
    private bool m_tagsStripped = false; 

    /// <summary> 
    /// Gets or sets the source string. 
    /// </summary> 
    /// <value> 
    /// The source string. 
    /// </value> 
    public string source { get { return m_source; } set { clear(); m_source = value; stripTags(); } } 

    /// <summary> 
    /// Gets the string stripped of HTML tags. 
    /// </summary> 
    /// <value> 
    /// The string. 
    /// </value> 
    public string stripped { get { return m_stripped; } set { } } 

    /// <summary> 
    /// Gets or sets a value indicating whether [HTML tags were stripped]. 
    /// </summary> 
    /// <value> 
    /// <c>true</c> if [HTML tags were stripped]; otherwise, <c>false</c>. 
    /// </value> 
    public bool tagsStripped { get { return m_tagsStripped; } set { } } 

    /// <summary> 
    /// Adds the name of an HTML tag to skip stripping (leave in the text). 
    /// </summary> 
    /// <param name="value">The value.</param> 
    public void addSkipTag(string value) 
    { 
     if (value.Length > 0) 
     { 
      // Trim start and end tokens from skipTags if present and add to list 
      CharEnumerator tmpScanner = value.GetEnumerator(); 
      string tmpString = string.Empty; 
      while (tmpScanner.MoveNext()) 
      { 
       if (tmpScanner.Current != m_beginToken && tmpScanner.Current != m_endToken) { tmpString += tmpScanner.Current; } 
      } 
      if (tmpString.Length > 0) { m_skipTags.Add(tmpString); } 
     } 
    } 

    /// <summary> 
    /// Clears this instance. 
    /// </summary> 
    public void clear() 
    { 
     m_source = string.Empty; 
     m_tag = string.Empty; 
     m_startpos = -1; 
     m_endpos = -1; 
     m_currentpos = -1; 
     m_tagsStripped = false; 
    } 

    /// <summary> 
    /// Clears all. 
    /// </summary> 
    public void clearAll() 
    { 
     this.clear(); 
     m_skipTags.Clear(); 
    } 

    /// <summary> 
    /// Strips the HTML tags. 
    /// </summary> 
    private void stripTags() 
    { 
     // Preserve source and make a copy for stripping 
     m_stripped = m_source; 
     // Find first tag 
     getNext(); 
     // If there are any tags (if next tag is string.Empty we are at EOS)... 
     if (m_tagName != string.Empty) 
     { 
      do 
      { 
       // If the tag we found is not to be skipped... 
       if (!m_skipTags.Contains(m_tagName)) 
       { 
        // Remove tag from string 
        m_stripped = m_stripped.Remove(m_startpos, m_endpos - m_startpos + 1); 
        m_tagsStripped = true; 
       } 
       // Get next tag, rinse and repeat (if next tag is string.Empty we are at EOS) 
       getNext(); 
      } while (m_tagName != string.Empty); 
     } 
    } 

    /// <summary> 
    /// Steps the pointer to the next HTML tag. 
    /// </summary> 
    private void getNext() 
    { 
     m_tagFound = false; 
     m_tag = string.Empty; 
     m_tagName = string.Empty; 
     bool beginTokenFound = false; 
     CharEnumerator scanner = m_stripped.GetEnumerator(); 
     // If we're not at the beginning of the string, move the enumerator to the appropriate location in the string 
     if (m_currentpos != -1) 
     { 
      Int32 index = 0; 
      do 
      { 
       scanner.MoveNext(); 
       index += 1; 
      } while (index < m_currentpos + 1); 
     } 
     while (!m_tagFound && m_currentpos + 1 < m_stripped.Length) 
     { 
      // Find next begin token 
      while (scanner.MoveNext()) 
      { 
       m_currentpos += 1; 
       if (evaluateChar(scanner.Current) == tokenType.beginToken) 
       { 
        m_startpos = m_currentpos; 
        beginTokenFound = true; 
        break; 
       } 
      } 
      // If a begin token is found, find next end token 
      if (beginTokenFound) 
      { 
       while (scanner.MoveNext()) 
       { 
        m_currentpos += 1; 
        // If we find another begin token before finding an end token we are not in a tag 
        if (evaluateChar(scanner.Current) == tokenType.beginToken) 
        { 
         m_tagFound = false; 
         beginTokenFound = true; 
         break; 
        } 
        // If the char immediately following a begin token is a white space we are not in a tag 
        if (m_currentpos - m_startpos == 1 && evaluateChar(scanner.Current) == tokenType.whiteSpace) 
        { 
         m_tagFound = false; 
         beginTokenFound = true; 
         break; 
        } 
        // End token found 
        if (evaluateChar(scanner.Current) == tokenType.endToken) 
        { 
         m_endpos = m_currentpos; 
         m_tagFound = true; 
         break; 
        } 
       } 
      } 
      if (m_tagFound) 
      { 
       // Found a tag, get the info for this tag 
       m_tag = m_stripped.Substring(m_startpos, (m_endpos + 1) - m_startpos); 
       m_tagName = m_stripped.Substring(m_startpos + 1, m_endpos - m_startpos - 1); 
       // If this tag is to be skipped, we do not want to reset the position within the string 
       // Also, if we are at the end of the string (EOS) we do not want to reset the position 
       if (!m_skipTags.Contains(m_tagName) && m_currentpos != stripped.Length) 
       { 
        m_currentpos = -1; 
       } 
      } 
     } 
    } 

    /// <summary> 
    /// Evaluates the next character. 
    /// </summary> 
    /// <param name="value">The value.</param> 
    /// <returns>tokenType</returns> 
    private tokenType evaluateChar(char value) 
    { 
     tokenType returnValue = new tokenType(); 
     switch (value) 
     { 
      case m_beginToken: 
       returnValue = tokenType.beginToken; 
       break; 
      case m_endToken: 
       returnValue = tokenType.endToken; 
       break; 
      case m_whiteSpace: 
       returnValue = tokenType.whiteSpace; 
       break; 
      default: 
       returnValue = tokenType.nonToken; 
       break; 
     } 
     return returnValue; 
    } 
}

來源

2017-09-12 20:33:58 Kevin

你可以使用這個模式來取代HTML標籤：

</?[a-zA-Z][a-zA-Z0-9 \"=_-]*?>

說明：

< 
maybe/(as it may be closing tag) 
    match a-z or A-Z as the first letter 
     MAYBE match any of a-z, or A-Z, 0-9, "=_- indefinitely 
      >

終極密碼：

using System; 
using System.Text.RegularExpressions; 
namespace Regular 
{ 
    class Program 
    { 
     static void Main(string[] args) 
     { 
      string yourText = "this text has <b>weird < things</b> going on >"; 
      string newText = Regex.Replace(yourText, "</?[a-zA-Z][a-zA-Z0-9 \"=_-]*>", ""); 
      Console.WriteLine(newText); 
     } 
    } 
}

輸出：

這個文本有奇怪的<事情在進行>

@corey-ogburn的評論是不正確的，因爲< [空格] abc>將被替換。

當你只是想帶他們離開串，我不明白了一個道理，你會想檢查，如果你有一個標籤開始/結束，但你可以很容易地用正則表達式做到這一點。

這並不總是使用正則表達式解析HTML一個不錯的選擇，但我覺得這是很好，如果你想解析簡單的文本。

來源

2017-09-13 11:31:24 Droppy

從消息刪除HTML安全

回答

相關問題