從字符串包括：在C＃NBSP

如何刪除所有的HTML標籤，包括在C＃中使用正則表達式& NBSP刪除HTML標籤。我的字符串看起來像從字符串包括：在C＃NBSP

"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

來源

2013-10-22 rampuriyaaa

不要使用正則表達式，檢查出的HTML敏捷性包。 http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack – Tim

感謝蒂姆，但應用程序是相當大的，完整的，添加或下載HTML敏捷包將無法正常工作。 – rampuriyaaa

172

如果你不能使用HTML解析器以過濾標籤爲主的解決方案，這是一個簡單的正則表達式。

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

理論上，應該再拍該負責多個空格作爲

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

來源

2013-10-22 17:08:21

我還沒有測試過這個就像我需要，但它的工作比我預期的要更好地工作。我將發佈我在下面寫的方法。 –

懶惰匹配（？'<[^>] +>'按@大衛S.）可能使這個稍快一點，但只用在現場的項目該解決方案 - 很開心:) +1 –

Regex.Replace（inputHTML，@ 「<[^>] +> |＆nbsp | \ n;」，「」）.Trim（）; \ n不得到去除 –

這樣的：

(<.+?> | &nbsp;)

將匹配任何標記或 

string regex = @"(<.+?>|&nbsp;)"; 
var x = Regex.Replace(originalString, regex, "").Trim();

則x = hello

來源

2013-10-22 17:08:10 Jonesopolis

我一直在使用這個功能了一會兒穿過一個正則表達式過濾器。刪除幾乎任何雜亂的HTML，你可以扔在它，並保持文本完好無損。

 private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled); 

     //add characters that are should not be removed to this regex 
     private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled); 

     public static String UnHtml(String html) 
     { 
      html = HttpUtility.UrlDecode(html); 
      html = HttpUtility.HtmlDecode(html); 

      html = RemoveTag(html, "<!--", "-->"); 
      html = RemoveTag(html, "<script", "</script>"); 
      html = RemoveTag(html, "<style", "</style>"); 

      //replace matches of these regexes with space 
      html = _tags_.Replace(html, " "); 
      html = _notOkCharacter_.Replace(html, " "); 
      html = SingleSpacedTrim(html); 

      return html; 
     } 

     private static String RemoveTag(String html, String startTag, String endTag) 
     { 
      Boolean bAgain; 
      do 
      { 
       bAgain = false; 
       Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase); 
       if (startTagPos < 0) 
        continue; 
       Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase); 
       if (endTagPos <= startTagPos) 
        continue; 
       html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length); 
       bAgain = true; 
      } while (bAgain); 
      return html; 
     } 

     private static String SingleSpacedTrim(String inString) 
     { 
      StringBuilder sb = new StringBuilder(); 
      Boolean inBlanks = false; 
      foreach (Char c in inString) 
      { 
       switch (c) 
       { 
        case '\r': 
        case '\n': 
        case '\t': 
        case ' ': 
         if (!inBlanks) 
         { 
          inBlanks = true; 
          sb.Append(' '); 
         } 
         continue; 
        default: 
         inBlanks = false; 
         sb.Append(c); 
         break; 
       } 
      } 
      return sb.ToString().Trim(); 
     }

來源

2013-10-22 17:14:30

只需確認：SingleSpacedTrim（）函數與字符串noHTMLNormalised = Regex.Replace（noHTML，@「\ s {2，}」，「」）的作用相同。來自Ravi Thapliyal的回答？ – Jimmy

@Jimmy據我所知，該正則表達式不會像SingleSpacedTrim（）那樣捕獲單個標籤或換行符。這可能是一個理想的效果，在這種情況下，只需根據需要移除這些案例。 –

不錯，但它似乎用空格替換單引號和雙引號，雖然它們不在「_notOkCharacter_」列表中，或者我在那裏丟失了什麼？解碼/編碼方法的這一部分在開始時被稱爲？有必要保持這些角色的完整性？ – vm370

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

來源

2014-06-11 06:27:50 MRP

我把@Ravi Thapliyal的代碼，並提出了方法：這是簡單的，並且可能不乾淨的一切，但到目前爲止，它是做什麼的，我需要做的事。

public static string ScrubHtml(string value) { 
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim(); 
    var step2 = Regex.Replace(step1, @"\s{2,}", " "); 
    return step2; 
}

來源

2014-07-31 14:50:46

-1

清理Html文檔涉及很多棘手的事情。該軟件包的幫助可能： https://github.com/mganss/HtmlSanitizer

來源

2016-01-04 19:54:16 ehsan88

-1

(<([^>]+)>|&nbsp;)

你可以在這裏進行測試： https://regex101.com/r/kB0rQ4/1

來源

2017-02-10 17:58:20

從字符串包括：在C＃NBSP

回答

相關問題