2009-12-18 23 views
8

你知道用PHP編寫的任何好的HTML到純文本轉換類 嗎?HTML到純文本(用於電子郵件)

我需要它將HTML郵件正文轉換爲純文本郵件正文。

我寫了簡單的功能, 但我需要更多的功能,如轉換表,在結束 添加鏈接,轉換嵌套列表...

- 問候
takeshin

+0

爲什麼不只是發送HTML郵件?據我所知,僞造表是一種可能的明文,但在世界的每封電子郵件閱讀器讀取HTML,爲什麼不救自己毫無意義的轉換的麻煩,因爲你或其他人拒絕使用HTML郵件。 – TravisO 2009-12-18 19:58:45

+4

TravisO:並不是每個讀者。有些不會自動將HTML轉換爲純文本。對於一個用戶來說,原始的HTML通常不是很好閱讀:-) – Joey 2009-12-18 20:07:19

+0

1996年已經結束,可以使用它。但當然,那些討厭HTML電子郵件的精英分子將會成爲最有聲望/願意爲這些理想投票的人。 – TravisO 2009-12-18 21:00:11

回答

6
+2

在文本電子郵件中減價有什麼好處? – 2009-12-18 22:49:02

+1

呃,你有沒有使用或讀過任何關於Markdown的內容? 「Markdown格式化語法的首要設計目標是儘可能使其可讀。**想法是,Markdown格式的文檔應該可以按照原樣發佈,不會看起來像標籤或格式化指令。**「 – ceejayoz 2009-12-18 23:55:08

+2

Markdownify確實是一個很好的解決方案。我之前看過它,但我認爲它不會轉換表格。但問題是,我試着用''attributies和一些CSS風格的桌子上試過。 我手工刪除了字幕和類和風格attributies,它很好。 – takeshin 2009-12-20 13:15:21

3

一個特別的郵件發送實施在這裏簡單地滋生lynx與HTML和使用其輸出的文本版本。這不是完美的,但工程。您也可以使用linkselinks

+0

整潔的想法,我喜歡它。 – ceejayoz 2009-12-18 20:11:20

+0

是的,這已經在StackOverflow上提出,但我要求PHP靈魂。我沒有訪問我的服務器的l x。謝謝。 – takeshin 2009-12-20 13:12:47

+0

您忘了提及您需要使用'-dump' arg來l 012 – JoelFan 2011-04-13 03:19:37

1

我知道這個問題是關於PHP,但是我用了山貓的想法,使這個Perl子轉換爲HTML文本:

use File::Temp; 

sub html2Txt { 
    my $html = shift; 
    my $htmlF = File::Temp->new(SUFFIX => '.html'); 
    print $htmlF $html; 
    close $htmlF; 
    return scalar `/usr/bin/lynx -dump $htmlF 2> /dev/null`; 
} 

print html2Txt '<b>Hi there</b> Testing'; 

打印:Hi there Testing

2

您可以用lynx與-stdin和 - 轉儲選項來實現這一點:

<?php 
$descriptorspec = array(
    0 => array("pipe", "r"), // stdin is a pipe that the child will read from 
    1 => array("pipe", "w"), // stdout is a pipe that the child will write to 
    2 => array("file", "/tmp/htmp2txt.log", "a") // stderr is a file to write to 
); 

$process = proc_open('lynx -stdin -dump 2>&1', $descriptorspec, $pipes, '/tmp', NULL); 

if (is_resource($process)) { 
    // $pipes now looks like this: 
    // 0 => writeable handle connected to child stdin 
    // 1 => readable handle connected to child stdout 
    // Any error output will be appended to htmp2txt.log 

    $stdin = $pipes[0]; 
    fwrite($stdin, <<<'EOT' 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 
<head> 
<title>TEST</title> 
</head> 
<body> 
<h1><span>Lorem Ipsum</span></h1> 

<h4>"Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit..."</h4> 
<h5>"There is no one who loves pain itself, who seeks after it and wants to have it, simply because it is pain..."</h5> 
<p> 
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque et sapien ut erat porttitor suscipit id nec dui. Nam rhoncus mauris ac dui tristique bibendum. Aliquam molestie placerat gravida. Duis vitae tortor gravida libero semper cursus eu ut tortor. Nunc id orci orci. Suspendisse potenti. Phasellus vehicula leo sed erat rutrum sed blandit purus convallis. 
</p> 
<p> 
Aliquam feugiat, neque a tempus rhoncus, neque dolor vulputate eros, non pellentesque elit lacus ut nunc. Pellentesque vel purus libero, ultrices condimentum lorem. Nam dictum faucibus mollis. Praesent adipiscing nunc sed dui ultricies molestie. Quisque facilisis purus quis felis molestie ut accumsan felis ultricies. Curabitur euismod est id est pretium accumsan. Praesent a mi in dolor feugiat vehicula quis at elit. Mauris lacus mauris, laoreet non molestie nec, adipiscing a nulla. Nullam rutrum, libero id pellentesque tempus, erat nibh ornare dolor, id accumsan est risus at leo. In convallis felis at eros condimentum adipiscing aliquam nisi faucibus. Integer arcu ligula, porttitor in fermentum vitae, lacinia nec dui. 
</p> 
</body> 
</html> 
EOT 
    ); 
    fclose($stdin); 

    echo stream_get_contents($pipes[1]); 
    fclose($pipes[1]); 

    // It is important that you close any pipes before calling 
    // proc_close in order to avoid a deadlock 
    $return_value = proc_close($process); 

    echo "command returned $return_value\n"; 
} 
3

只有當您有權限在服務器上運行可執行文件時才使用lynx選項。但是,這樣做不被認爲是一種好的做法。此外,在安全主機中,PHP進程限於無法產生bash會話,這是運行lynx所必需的。

完全由PHP編寫的最完整的解決方案,我能找到的是Horde_Text_Filter_Html2text類。它是Horde framework的一部分。

我試過其他解決方案包括:

如果有人得到了完美的解決方案,請張貼回來作進一步參考!

1

在C#:

private string StripHTML(string source) 
{ 
    try 
    { 
     string result; 

     // Remove HTML Development formatting 
     // Replace line breaks with space 
     // because browsers inserts space 
     result = source.Replace("\r", " "); 
     // Replace line breaks with space 
     // because browsers inserts space 
     result = result.Replace("\n", " "); 
     // Remove step-formatting 
     result = result.Replace("\t", string.Empty); 
     // Remove repeating spaces because browsers ignore them 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
                   @"()+", " "); 

     // Remove the header (prepare first by clearing attributes) 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<()*head([^>])*>", "<head>", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"(<()*(/)()*head()*>)", "</head>", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       "(<head>).*(</head>)", string.Empty, 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     // remove all scripts (prepare first by clearing attributes) 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<()*script([^>])*>", "<script>", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"(<()*(/)()*script()*>)", "</script>", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     //result = System.Text.RegularExpressions.Regex.Replace(result, 
     //   @"(<script>)([^(<script>\.</script>)])*(</script>)", 
     //   string.Empty, 
     //   System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"(<script>).*(</script>)", string.Empty, 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     // remove all styles (prepare first by clearing attributes) 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<()*style([^>])*>", "<style>", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"(<()*(/)()*style()*>)", "</style>", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       "(<style>).*(</style>)", string.Empty, 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     // insert tabs in spaces of <td> tags 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<()*td([^>])*>", "\t", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     // insert line breaks in places of <BR> and <LI> tags 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<()*br()*>", "\r", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<()*li()*>", "\r", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     // insert line paragraphs (double line breaks) in place 
     // if <P>, <DIV> and <TR> tags 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<()*div([^>])*>", "\r\r", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<()*tr([^>])*>", "\r\r", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<()*p([^>])*>", "\r\r", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     // Remove remaining tags like <a>, links, images, 
     // comments etc - anything that's enclosed inside < > 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"<[^>]*>", string.Empty, 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     // replace special characters: 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @" ", " ", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&bull;", " * ", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&lsaquo;", "<", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&rsaquo;", ">", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&trade;", "(tm)", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&frasl;", "/", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&lt;", "<", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&gt;", ">", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&copy;", "(c)", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&reg;", "(r)", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     // Remove all others. More can be added, see 
     // http://hotwired.lycos.com/webmonkey/reference/special_characters/ 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       @"&(.{2,6});", string.Empty, 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     // for testing 
     //System.Text.RegularExpressions.Regex.Replace(result, 
     //  this.txtRegex.Text,string.Empty, 
     //  System.Text.RegularExpressions.RegexOptions.IgnoreCase); 

     // make line breaking consistent 
     result = result.Replace("\n", "\r"); 

     // Remove extra line breaks and tabs: 
     // replace over 2 breaks with 2 and over 4 tabs with 4. 
     // Prepare first to remove any whitespaces in between 
     // the escaped characters and remove redundant tabs in between line breaks 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       "(\r)()+(\r)", "\r\r", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       "(\t)()+(\t)", "\t\t", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       "(\t)()+(\r)", "\t\r", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       "(\r)()+(\t)", "\r\t", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     // Remove redundant tabs 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       "(\r)(\t)+(\r)", "\r\r", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     // Remove multiple tabs following a line break with just one tab 
     result = System.Text.RegularExpressions.Regex.Replace(result, 
       "(\r)(\t)+", "\r\t", 
       System.Text.RegularExpressions.RegexOptions.IgnoreCase); 
     // Initial replacement target string for line breaks 
     string breaks = "\r\r\r"; 
     // Initial replacement target string for tabs 
     string tabs = "\t\t\t\t\t"; 
     for (int index = 0; index < result.Length; index++) 
     { 
      result = result.Replace(breaks, "\r\r"); 
      result = result.Replace(tabs, "\t\t\t\t"); 
      breaks = breaks + "\r"; 
      tabs = tabs + "\t"; 
     } 

     // That's it. 
     return result; 
    } 
    catch 
    { 
     MessageBox.Show("Error"); 
     return source; 
    } 
} 
相關問題