2017-09-12 57 views
-1

我需要輸出所有的明文的,其可以包括有效和/或無效的HTML和可能的文本是表面上類似於HTML消息內(內即非HTML文本<...>如:< why would someone do this?? >)。從消息刪除HTML安全

保留所有非HTML內容比刪除所有HTML更重要,但理想情況下,我希望擺脫儘可能多的HTML以提高可讀性。

我目前使用的HTML敏捷性包,但我有問題,其中內<>非HTML也將被刪除,例如:

我的功能:

text = HttpUtility.HtmlDecode(text); 
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); 
doc.LoadHtml(text); 
text = doc.DocumentNode.InnerText; 

簡單的例子,輸入* :

this text has <b>weird < things</b> going on > 

實際輸出(不可接受的,失去了單詞 「東西」):

this text has weird going on > 

所需的輸出:

this text has weird < things going on > 

有沒有一種方法,以消除HTML敏捷性包內唯一合法的HTML標籤不剝出其他內容可能包括<和/或>?或者是否需要手動創建一個白名單標籤以刪除,如this question?這是我的回退解決方案,但我希望有一個更完整的解決方案內置於HTML敏捷包(或其他工具),我只是無法找到。

*(實際輸入往往有一噸的它不需要HTML的,我可以給一個較長的例子,如果這樣做是有用的)

+0

當處理破損的HTML時,你將會有缺陷。 HTMLAgilityPack將「東西」解釋爲HTML內容的一部分並不奇怪。當HTML無效時,圖書館必須使用啓發式方法進行猜測,這些啓發式算法並不完美。即使你像Kevin在答案中編寫自己的解析器一樣,你也不會變得更好。 – Amy

+0

我找到了正則表達式'/ <[^>]> /'是找到並移除標籤的好方法。所以'Regex.Replace(輸入,「<[^>」>「,」「)'應該是一個很好的起點。儘管如此,避免完全解析HTML會更好。 –

回答

0

我寫了這個很長一段時間以前做類似的事情。您可以使用它作爲一個起點:

你需要:

using System; 
using System.Collections.Generic; 

,代碼:

/// <summary> 
/// Instances of this class strip HTML/XML tags from a string 
/// </summary> 
public class HTMLStripper 
{ 
    public HTMLStripper() { } 
    public HTMLStripper(string source) 
    { 
     m_source = source; 
     stripTags(); 
    } 

    private const char m_beginToken = '<'; 
    private const char m_endToken = '>'; 
    private const char m_whiteSpace = ' '; 

    private enum tokenType 
    { 
     nonToken = 0, 
     beginToken = 1, 
     endToken = 2, 
     escapeToken = 3, 
     whiteSpace = 4 
    } 

    private string m_source = string.Empty; 
    private string m_stripped = string.Empty; 
    private string m_tagName = string.Empty; 
    private string m_tag = string.Empty; 
    private Int32 m_startpos = -1; 
    private Int32 m_endpos = -1; 
    private Int32 m_currentpos = -1; 
    private IList<string> m_skipTags = new List<string>(); 
    private bool m_tagFound = false; 
    private bool m_tagsStripped = false; 

    /// <summary> 
    /// Gets or sets the source string. 
    /// </summary> 
    /// <value> 
    /// The source string. 
    /// </value> 
    public string source { get { return m_source; } set { clear(); m_source = value; stripTags(); } } 

    /// <summary> 
    /// Gets the string stripped of HTML tags. 
    /// </summary> 
    /// <value> 
    /// The string. 
    /// </value> 
    public string stripped { get { return m_stripped; } set { } } 

    /// <summary> 
    /// Gets or sets a value indicating whether [HTML tags were stripped]. 
    /// </summary> 
    /// <value> 
    /// <c>true</c> if [HTML tags were stripped]; otherwise, <c>false</c>. 
    /// </value> 
    public bool tagsStripped { get { return m_tagsStripped; } set { } } 

    /// <summary> 
    /// Adds the name of an HTML tag to skip stripping (leave in the text). 
    /// </summary> 
    /// <param name="value">The value.</param> 
    public void addSkipTag(string value) 
    { 
     if (value.Length > 0) 
     { 
      // Trim start and end tokens from skipTags if present and add to list 
      CharEnumerator tmpScanner = value.GetEnumerator(); 
      string tmpString = string.Empty; 
      while (tmpScanner.MoveNext()) 
      { 
       if (tmpScanner.Current != m_beginToken && tmpScanner.Current != m_endToken) { tmpString += tmpScanner.Current; } 
      } 
      if (tmpString.Length > 0) { m_skipTags.Add(tmpString); } 
     } 
    } 

    /// <summary> 
    /// Clears this instance. 
    /// </summary> 
    public void clear() 
    { 
     m_source = string.Empty; 
     m_tag = string.Empty; 
     m_startpos = -1; 
     m_endpos = -1; 
     m_currentpos = -1; 
     m_tagsStripped = false; 
    } 

    /// <summary> 
    /// Clears all. 
    /// </summary> 
    public void clearAll() 
    { 
     this.clear(); 
     m_skipTags.Clear(); 
    } 

    /// <summary> 
    /// Strips the HTML tags. 
    /// </summary> 
    private void stripTags() 
    { 
     // Preserve source and make a copy for stripping 
     m_stripped = m_source; 
     // Find first tag 
     getNext(); 
     // If there are any tags (if next tag is string.Empty we are at EOS)... 
     if (m_tagName != string.Empty) 
     { 
      do 
      { 
       // If the tag we found is not to be skipped... 
       if (!m_skipTags.Contains(m_tagName)) 
       { 
        // Remove tag from string 
        m_stripped = m_stripped.Remove(m_startpos, m_endpos - m_startpos + 1); 
        m_tagsStripped = true; 
       } 
       // Get next tag, rinse and repeat (if next tag is string.Empty we are at EOS) 
       getNext(); 
      } while (m_tagName != string.Empty); 
     } 
    } 

    /// <summary> 
    /// Steps the pointer to the next HTML tag. 
    /// </summary> 
    private void getNext() 
    { 
     m_tagFound = false; 
     m_tag = string.Empty; 
     m_tagName = string.Empty; 
     bool beginTokenFound = false; 
     CharEnumerator scanner = m_stripped.GetEnumerator(); 
     // If we're not at the beginning of the string, move the enumerator to the appropriate location in the string 
     if (m_currentpos != -1) 
     { 
      Int32 index = 0; 
      do 
      { 
       scanner.MoveNext(); 
       index += 1; 
      } while (index < m_currentpos + 1); 
     } 
     while (!m_tagFound && m_currentpos + 1 < m_stripped.Length) 
     { 
      // Find next begin token 
      while (scanner.MoveNext()) 
      { 
       m_currentpos += 1; 
       if (evaluateChar(scanner.Current) == tokenType.beginToken) 
       { 
        m_startpos = m_currentpos; 
        beginTokenFound = true; 
        break; 
       } 
      } 
      // If a begin token is found, find next end token 
      if (beginTokenFound) 
      { 
       while (scanner.MoveNext()) 
       { 
        m_currentpos += 1; 
        // If we find another begin token before finding an end token we are not in a tag 
        if (evaluateChar(scanner.Current) == tokenType.beginToken) 
        { 
         m_tagFound = false; 
         beginTokenFound = true; 
         break; 
        } 
        // If the char immediately following a begin token is a white space we are not in a tag 
        if (m_currentpos - m_startpos == 1 && evaluateChar(scanner.Current) == tokenType.whiteSpace) 
        { 
         m_tagFound = false; 
         beginTokenFound = true; 
         break; 
        } 
        // End token found 
        if (evaluateChar(scanner.Current) == tokenType.endToken) 
        { 
         m_endpos = m_currentpos; 
         m_tagFound = true; 
         break; 
        } 
       } 
      } 
      if (m_tagFound) 
      { 
       // Found a tag, get the info for this tag 
       m_tag = m_stripped.Substring(m_startpos, (m_endpos + 1) - m_startpos); 
       m_tagName = m_stripped.Substring(m_startpos + 1, m_endpos - m_startpos - 1); 
       // If this tag is to be skipped, we do not want to reset the position within the string 
       // Also, if we are at the end of the string (EOS) we do not want to reset the position 
       if (!m_skipTags.Contains(m_tagName) && m_currentpos != stripped.Length) 
       { 
        m_currentpos = -1; 
       } 
      } 
     } 
    } 

    /// <summary> 
    /// Evaluates the next character. 
    /// </summary> 
    /// <param name="value">The value.</param> 
    /// <returns>tokenType</returns> 
    private tokenType evaluateChar(char value) 
    { 
     tokenType returnValue = new tokenType(); 
     switch (value) 
     { 
      case m_beginToken: 
       returnValue = tokenType.beginToken; 
       break; 
      case m_endToken: 
       returnValue = tokenType.endToken; 
       break; 
      case m_whiteSpace: 
       returnValue = tokenType.whiteSpace; 
       break; 
      default: 
       returnValue = tokenType.nonToken; 
       break; 
     } 
     return returnValue; 
    } 
} 
0

你可以使用這個模式來取代HTML標籤:

</?[a-zA-Z][a-zA-Z0-9 \"=_-]*?> 

說明:

< 
maybe/(as it may be closing tag) 
    match a-z or A-Z as the first letter 
     MAYBE match any of a-z, or A-Z, 0-9, "=_- indefinitely 
      > 

終極密碼:

using System; 
using System.Text.RegularExpressions; 
namespace Regular 
{ 
    class Program 
    { 
     static void Main(string[] args) 
     { 
      string yourText = "this text has <b>weird < things</b> going on >"; 
      string newText = Regex.Replace(yourText, "</?[a-zA-Z][a-zA-Z0-9 \"=_-]*>", ""); 
      Console.WriteLine(newText); 
     } 
    } 
} 

輸出:

這個文本有奇怪的<事情在進行>


@corey-ogburn的評論是不正確的,因爲< [空格] abc>將被替換。


當你只是想帶他們離開串,我不明白了一個道理,你會想檢查,如果你有一個標籤開始/結束,但你可以很容易地用正則表達式做到這一點。


這並不總是使用正則表達式解析HTML一個不錯的選擇,但我覺得這是很好,如果你想解析簡單的文本。