2008-09-29 42 views
32

是否有免費的第三方或.NET類將HTML轉換爲RTF(用於啓用富文本的Windows窗體控件)?如何在不支付組件的情況下將HTML轉換爲RTF(Rich Text)?

「免費」要求來自於我只在一個原型上工作,只需加載BrowserControl並在需要時呈現HTML(即使速度很慢),並且Developer Express將會是儘快釋放他們自己的這種控制。

我不想學習編寫RTF,而且我已經知道HTML,所以我認爲這是快速獲得一些可證明代碼的最快捷方式。

回答

34

其實有一個簡單而免費解決方案:使用你的瀏覽器,確定這是我用過的伎倆:

var webBrowser = new WebBrowser(); 
webBrowser.CreateControl(); // only if needed 
webBrowser.DocumentText = *yourhtmlstring*; 
while (_webBrowser.DocumentText != *yourhtmlstring*) 
    Application.DoEvents(); 
webBrowser.Document.ExecCommand("SelectAll", false, null); 
webBrowser.Document.ExecCommand("Copy", false, null); 
*yourRichTextControl*.Paste(); 

這可能是比其它方法速度慢,但至少它是免費的,作品!

3

這當然不完美,但這裏是我用來將HTML轉換爲純文本的代碼。

public static string ConvertHtmlToText(string source) { 

      string result; 

      // Remove HTML Development formatting 
      // Replace line breaks with space 
      // because browsers inserts space 
      result = source.Replace("\r", " "); 
      // Replace line breaks with space 
      // because browsers inserts space 
      result = result.Replace("\n", " "); 
      // Remove step-formatting 
      result = result.Replace("\t", string.Empty); 
      // Remove repeating speces becuase browsers ignore them 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
                    @"()+", " "); 

      // Remove the header (prepare first by clearing attributes) 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*head([^>])*>", "<head>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"(<()*(/)()*head()*>)", "</head>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(<head>).*(</head>)", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // remove all scripts (prepare first by clearing attributes) 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*script([^>])*>", "<script>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"(<()*(/)()*script()*>)", "</script>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      //result = System.Text.RegularExpressions.Regex.Replace(result, 
      //   @"(<script>)([^(<script>\.</script>)])*(</script>)", 
      //   string.Empty, 
      //   System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"(<script>).*(</script>)", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // remove all styles (prepare first by clearing attributes) 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*style([^>])*>", "<style>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"(<()*(/)()*style()*>)", "</style>", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(<style>).*(</style>)", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // insert tabs in spaces of <td> tags 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*td([^>])*>", "\t", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // insert line breaks in places of <BR> and <LI> tags 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*br()*>", "\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*li()*>", "\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // insert line paragraphs (double line breaks) in place 
      // if <P>, <DIV> and <TR> tags 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*div([^>])*>", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*tr([^>])*>", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<()*p([^>])*>", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // Remove remaining tags like <a>, links, images, 
      // comments etc - anything thats enclosed inside < > 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<[^>]*>", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      // replace special characters: 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&nbsp;", " ", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&bull;", " * ", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&lsaquo;", "<", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&rsaquo;", ">", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&trade;", "(tm)", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&frasl;", "/", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"<", "<", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @">", ">", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&copy;", "(c)", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&reg;", "(r)", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      // Remove all others. More can be added, see 
      // http://hotwired.lycos.com/webmonkey/reference/special_characters/ 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        @"&(.{2,6});", string.Empty, 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 


      // make line breaking consistent 
      result = result.Replace("\n", "\r"); 

      // Remove extra line breaks and tabs: 
      // replace over 2 breaks with 2 and over 4 tabs with 4. 
      // Prepare first to remove any whitespaces inbetween 
      // the escaped characters and remove redundant tabs inbetween linebreaks 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\r)()+(\r)", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\t)()+(\t)", "\t\t", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\t)()+(\r)", "\t\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\r)()+(\t)", "\r\t", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      // Remove redundant tabs 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\r)(\t)+(\r)", "\r\r", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      // Remove multible tabs followind a linebreak with just one tab 
      result = System.Text.RegularExpressions.Regex.Replace(result, 
        "(\r)(\t)+", "\r\t", 
        System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
      // Initial replacement target string for linebreaks 
      string breaks = "\r\r\r"; 
      // Initial replacement target string for tabs 
      string tabs = "\t\t\t\t\t"; 
      for (int index = 0; index < result.Length; index++) { 
       result = result.Replace(breaks, "\r\r"); 
       result = result.Replace(tabs, "\t\t\t\t"); 
       breaks = breaks + "\r"; 
       tabs = tabs + "\t"; 
      } 

      // Thats it. 
      return result; 

    } 
+8

Downvoted的原因雄辯地解釋這裏:http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – 2011-06-30 07:26:14

+0

諷刺這與XSLT可能會出現錯誤的原因幾乎相同。 HTML很混亂。很少有適合轉換的XML文檔。我懷疑一個合適的解決方案會包含一點正則表達式,以使得文檔足夠乾淨以進行適當的XSLT轉換。 – Menefee 2012-06-23 03:47:07

8

檢查出XHTML2RTF這個CodeProject上的文章(我是不是原作者,我從網上找到的代碼將它改編)。

+0

非常適合XHTML,但是正如您從閱讀該名稱時猜測的那樣,對於非XHTML /「香草HTML」不適用... – sager89 2014-08-29 15:08:56

+0

太棒了!製作一個控制檯應用程序。需要在控制檯主要方法之前添加[STAThread]。 – dforce 2016-11-03 15:02:37

4

在Spartaco的答案擴大我給以下偉大的工作提供了以下內容!

Using reportWebBrowser As New WebBrowser 
     reportWebBrowser.CreateControl() 
     reportWebBrowser.DocumentText = sbHTMLDoc.ToString 
     While reportWebBrowser.DocumentText <> sbHTMLDoc.ToString 
      Application.DoEvents() 
     End While 
     reportWebBrowser.Document.ExecCommand("SelectAll", False, Nothing) 
     reportWebBrowser.Document.ExecCommand("Copy", False, Nothing) 

     Using reportRichTextBox As New RichTextBox 
      reportRichTextBox.Paste() 
      reportRichTextBox.SaveFile(DocumentFileName) 
     End Using 
    End Using 
相關問題