如何在不支付組件的情況下將HTML轉換爲RTF（Rich Text）？

34

其實有一個簡單而免費解決方案：使用你的瀏覽器，確定這是我用過的伎倆：

var webBrowser = new WebBrowser(); 
webBrowser.CreateControl(); // only if needed 
webBrowser.DocumentText = *yourhtmlstring*; 
while (_webBrowser.DocumentText != *yourhtmlstring*) 
    Application.DoEvents(); 
webBrowser.Document.ExecCommand("SelectAll", false, null); 
webBrowser.Document.ExecCommand("Copy", false, null); 
*yourRichTextControl*.Paste();

這可能是比其它方法速度慢，但至少它是免費的，作品！

來源

2011-01-31 18:34:04 Spartaco

1

也許你需要的是a control to edit the HTML？

來源

2008-09-30 08:10:05 GvS

3

這當然不完美，但這裏是我用來將HTML轉換爲純文本的代碼。

public static string ConvertHtmlToText(string source) { 

      string result; 

      // Remove HTML Development formatting 
      // Replace line breaks with space 
      // because browsers inserts space 
      result = source.Replace("\r", " "); 
      // Replace line breaks with space 
      // because browsers inserts space 
      result = result.Replace("\n", " "); 
      // Remove step-formatting 
      result = result.Replace("\t", string.Empty); 
      // Remove repeating speces becuase browsers ignore them 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
                    @"()+", " "); 

      // Remove the header (prepare first by clearing attributes) 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*head([^>])*>", "<head>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"(<()*(/)()*head()*>)", "</head>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(<head>).*(</head>)", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // remove all scripts (prepare first by clearing attributes) 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*script([^>])*>", "<script>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"(<()*(/)()*script()*>)", "</script>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      //result = System.Text.RegularExpressions.Regex.Replace(result, 
      //   @"(<script>)([^(<script>\.</script>)])*(</script>)", 
      //   string.Empty, 
      //   System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"(<script>).*(</script>)", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // remove all styles (prepare first by clearing attributes) 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*style([^>])*>", "<style>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"(<()*(/)()*style()*>)", "</style>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(<style>).*(</style>)", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // insert tabs in spaces of <td> tags 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*td([^>])*>", "\t", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // insert line breaks in places of <BR> and <LI> tags 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*br()*>", "\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*li()*>", "\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // insert line paragraphs (double line breaks) in place 
      // if <P>, <DIV> and <TR> tags 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*div([^>])*>", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*tr([^>])*>", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*p([^>])*>", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // Remove remaining tags like <a>, links, images, 
      // comments etc - anything thats enclosed inside < > 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<[^>]*>", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // replace special characters: 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&nbsp;", " ", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&bull;", " * ", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&lsaquo;", "<", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&rsaquo;", ">", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&trade;", "(tm)", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&frasl;", "/", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<", "<", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @">", ">", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&copy;", "(c)", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&reg;", "(r)", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      // Remove all others. More can be added, see 
      // http://hotwired.lycos.com/webmonkey/reference/special_characters/ 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&(.{2,6});", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 


      // make line breaking consistent 
      result = result.Replace("\n", "\r"); 

      // Remove extra line breaks and tabs: 
      // replace over 2 breaks with 2 and over 4 tabs with 4. 
      // Prepare first to remove any whitespaces inbetween 
      // the escaped characters and remove redundant tabs inbetween linebreaks 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\r)()+(\r)", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\t)()+(\t)", "\t\t", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\t)()+(\r)", "\t\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\r)()+(\t)", "\r\t", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      // Remove redundant tabs 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\r)(\t)+(\r)", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      // Remove multible tabs followind a linebreak with just one tab 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\r)(\t)+", "\r\t", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      // Initial replacement target string for linebreaks 
      string breaks = "\r\r\r"; 
      // Initial replacement target string for tabs 
      string tabs = "\t\t\t\t\t"; 
      for (int index = 0; index < result.Length; index++) { 
       result = result.Replace(breaks, "\r\r"); 
       result = result.Replace(tabs, "\t\t\t\t"); 
       breaks = breaks + "\r"; 
       tabs = tabs + "\t"; 
      } 

      // Thats it. 
      return result; 

    }

來源

2008-09-30 21:11:35 Andrew

+8

Downvoted的原因雄辯地解釋這裏：http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – 2011-06-30 07:26:14

+0

諷刺這與XSLT可能會出現錯誤的原因幾乎相同。 HTML很混亂。很少有適合轉換的XML文檔。我懷疑一個合適的解決方案會包含一點正則表達式，以使得文檔足夠乾淨以進行適當的XSLT轉換。 – Menefee 2012-06-23 03:47:07

8

檢查出XHTML2RTF這個CodeProject上的文章（我是不是原作者，我從網上找到的代碼將它改編）。

來源

2009-04-16 03:03:40

+0

非常適合XHTML，但是正如您從閱讀該名稱時猜測的那樣，對於非XHTML /「香草HTML」不適用... – sager89 2014-08-29 15:08:56

+0

太棒了！製作一個控制檯應用程序。需要在控制檯主要方法之前添加[STAThread]。 – dforce 2016-11-03 15:02:37

4

在Spartaco的答案擴大我給以下偉大的工作提供了以下內容！

Using reportWebBrowser As New WebBrowser 
     reportWebBrowser.CreateControl() 
     reportWebBrowser.DocumentText = sbHTMLDoc.ToString 
     While reportWebBrowser.DocumentText <> sbHTMLDoc.ToString 
      Application.DoEvents() 
     End While 
     reportWebBrowser.Document.ExecCommand("SelectAll", False, Nothing) 
     reportWebBrowser.Document.ExecCommand("Copy", False, Nothing) 

     Using reportRichTextBox As New RichTextBox 
      reportRichTextBox.Paste() 
      reportRichTextBox.SaveFile(DocumentFileName) 
     End Using 
    End Using

來源

2011-02-17 21:01:21 cjbarth

如何在不支付組件的情況下將HTML轉換爲RTF（Rich Text）？

回答

相關問題