2013-12-08 37 views
1

我正在開發一個電子郵件查看器,它可以讀取.eml文件並在瀏覽器控件中顯示消息。我找到了代碼片段,它可以顯示7位和帶引號的可打印消息(Content-Transfer-Encoding:quoted-printable/Content-Transfer-Encoding:base64)。 我需要的是解碼8位消息。解碼8位郵件消息:內容傳輸編碼:8位

private static AlternateView ImportText(StringReader r, string encoding, System.Net.Mime.ContentType contentType) 
    { 
     string line = string.Empty; 
     StringBuilder b = new StringBuilder(); 
     while ((line = r.ReadLine())!= null) 
     { 
      switch (encoding) 
      { 
       case "quoted-printable": 
        if (line.EndsWith("=")) 
        { 
         b.Append(DecodeQuotedPrintables(line.TrimEnd('='), contentType.CharSet)); 
        } 
        else 
        { 
         b.Append(DecodeQuotedPrintables(line, contentType.CharSet) + "\n"); 
        } 
        break; 
       case "base64": 
        b.Append(DecodeBase64(line, contentType.CharSet)); 
        break; 

       case "8bit": // I need an 8bit decoder here!!! 
        b.Append(IneedAn8bitDecoderHere(line, contentType.CharSet)); 
        break; 
       default: 
        b.Append(line); 
        break; 
      } 
     } 

     AlternateView returnValue = AlternateView.CreateAlternateViewFromString(b.ToString(), null, contentType.MediaType); 
     returnValue.TransferEncoding = TransferEncoding.QuotedPrintable; 
     return returnValue; 
    } 

我用Google搜索的8位解碼器,但找不到任何。我真的需要8位解碼器嗎?你知道一個好的工作嗎?

UPDATE:

相關頭:在我的代碼(串線)

MIME-Version: 1.0 
Content-Type: text/plain; charset="koi8-r"; 
Content-Transfer-Encoding: 8bit 

體消息:

����������� �� ����, � ����� ��� � ������  ��������� ������� � �������� �������� �� ������� 

什麼展望顯示在真實世界:

Фантастично но факт, я снова как и раньше сделалась статной и красивой примерно за месяцок 

我想我不需要情況下, 「8位」:一部分。正如SLaks提到的那樣,我需要將郵件源加載到字節數組中,而不是在進程的開始時加入字符串。檢查來自字節數組的郵件頭中的字符集=將給出適當的代碼頁。

+1

一旦你讀入一個字符串,爲時已晚。你需要使用適當的'Encoding'來從字節解碼它。 – SLaks

+0

從[Exchange Server 2003:Content-Transfer-Encoding:8bit](http://msdn.microsoft.com/zh-cn/library/ms526992(v = exchg.10).aspx)8位編碼具有相同的行 - 作爲7位編碼的長度限制。它允許8位字符。 8位文件不需要編碼或解碼。由於並非所有MTA都能處理8位數據,所以8位編碼不是Internet郵件的有效編碼機制。 –

+0

這是關於二進制附件,根本不是字符串。 – SLaks

回答

2

這是我如何解決了這個問題:

// My previous method: 
string file = File.ReadAllText("koi8-r.eml"); 

// Correct method:  
Encoding efile = detectTextEncoding("koi8-r.eml", out file); 

txtRaw.Text = output; 

鏈接:detectEncoding()

// Function to detect the encoding for UTF-7, UTF-8/16/32 (bom, no bom, little 
// & big endian), and local default codepage, and potentially other codepages. 
// 'taster' = number of bytes to check of the file (to save processing). Higher 
// value is slower, but more reliable (especially UTF-8 with special characters 
// later on may appear to be ASCII initially). If taster = 0, then taster 
// becomes the length of the file (for maximum reliability). 'text' is simply 
// the string with the discovered encoding applied to the file. 
public Encoding detectTextEncoding(string filename, out String text, int taster = 1000) 
{ 
byte[] b = File.ReadAllBytes(filename); 

//////////////// First check the low hanging fruit by checking if a 
//////////////// BOM/signature exists (sourced from http://www.unicode.org/faq/utf_bom.html#bom4) 
if (b.Length >= 4 && b[0] == 0x00 && b[1] == 0x00 && b[2] == 0xFE && b[3] == 0xFF) { text = Encoding.GetEncoding("utf-32BE").GetString(b, 4, b.Length - 4); return Encoding.GetEncoding("utf-32BE"); } // UTF-32, big-endian 
else if (b.Length >= 4 && b[0] == 0xFF && b[1] == 0xFE && b[2] == 0x00 && b[3] == 0x00) { text = Encoding.UTF32.GetString(b, 4, b.Length - 4); return Encoding.UTF32; } // UTF-32, little-endian 
else if (b.Length >= 2 && b[0] == 0xFE && b[1] == 0xFF) { text = Encoding.BigEndianUnicode.GetString(b, 2, b.Length - 2); return Encoding.BigEndianUnicode; }  // UTF-16, big-endian 
else if (b.Length >= 2 && b[0] == 0xFF && b[1] == 0xFE) { text = Encoding.Unicode.GetString(b, 2, b.Length - 2); return Encoding.Unicode; }    // UTF-16, little-endian 
else if (b.Length >= 3 && b[0] == 0xEF && b[1] == 0xBB && b[2] == 0xBF) { text = Encoding.UTF8.GetString(b, 3, b.Length - 3); return Encoding.UTF8; } // UTF-8 
else if (b.Length >= 3 && b[0] == 0x2b && b[1] == 0x2f && b[2] == 0x76) { text = Encoding.UTF7.GetString(b,3,b.Length-3); return Encoding.UTF7; } // UTF-7 


//////////// If the code reaches here, no BOM/signature was found, so now 
//////////// we need to 'taste' the file to see if can manually discover 
//////////// the encoding. A high taster value is desired for UTF-8 
if (taster == 0 || taster > b.Length) taster = b.Length; // Taster size can't be bigger than the filesize obviously. 


// Some text files are encoded in UTF8, but have no BOM/signature. Hence 
// the below manually checks for a UTF8 pattern. This code is based off 
// the top answer at: https://stackoverflow.com/questions/6555015/check-for-invalid-utf8 
// For our purposes, an unnecessarily strict (and terser/slower) 
// implementation is shown at: https://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c 
// For the below, false positives should be exceedingly rare (and would 
// be either slightly malformed UTF-8 (which would suit our purposes 
// anyway) or 8-bit extended ASCII/UTF-16/32 at a vanishingly long shot). 
int i = 0; 
bool utf8 = false; 
while (i < taster - 4) 
{ 
    if (b[i] <= 0x7F) { i += 1; continue; }  // If all characters are below 0x80, then it is valid UTF8, but UTF8 is not 'required' (and therefore the text is more desirable to be treated as the default codepage of the computer). Hence, there's no "utf8 = true;" code unlike the next three checks. 
    if (b[i] >= 0xC2 && b[i] <= 0xDF && b[i + 1] >= 0x80 && b[i + 1] < 0xC0) { i += 2; utf8 = true; continue; } 
    if (b[i] >= 0xE0 && b[i] <= 0xF0 && b[i + 1] >= 0x80 && b[i + 1] < 0xC0 && b[i + 2] >= 0x80 && b[i + 2] < 0xC0) { i += 3; utf8 = true; continue; } 
    if (b[i] >= 0xF0 && b[i] <= 0xF4 && b[i + 1] >= 0x80 && b[i + 1] < 0xC0 && b[i + 2] >= 0x80 && b[i + 2] < 0xC0 && b[i + 3] >= 0x80 && b[i + 3] < 0xC0) { i += 4; utf8 = true; continue; } 
    utf8 = false; break; 
} 
if (utf8 == true) { 
    text = Encoding.UTF8.GetString(b); 
    return Encoding.UTF8; 
} 


// The next check is a heuristic attempt to detect UTF-16 without a BOM. 
// We simply look for zeroes in odd or even byte places, and if a certain 
// threshold is reached, the code is 'probably' UF-16.   
double threshold = 0.1; // proportion of chars step 2 which must be zeroed to be diagnosed as utf-16. 0.1 = 10% 
int count = 0; 
for (int n = 0; n < taster; n += 2) if (b[n] == 0) count++; 
if (((double)count)/taster > threshold) { text = Encoding.BigEndianUnicode.GetString(b); return Encoding.BigEndianUnicode; } 
count = 0; 
for (int n = 1; n < taster; n += 2) if (b[n] == 0) count++; 
if (((double)count)/taster > threshold) { text = Encoding.Unicode.GetString(b); return Encoding.Unicode; } // (little-endian) 


// Finally, a long shot - let's see if we can find "charset=xyz" or 
// "encoding=xyz" to identify the encoding: 
for (int n = 0; n < taster-9; n++) 
{ 
    if (
     ((b[n + 0] == 'c' || b[n + 0] == 'C') && (b[n + 1] == 'h' || b[n + 1] == 'H') && (b[n + 2] == 'a' || b[n + 2] == 'A') && (b[n + 3] == 'r' || b[n + 3] == 'R') && (b[n + 4] == 's' || b[n + 4] == 'S') && (b[n + 5] == 'e' || b[n + 5] == 'E') && (b[n + 6] == 't' || b[n + 6] == 'T') && (b[n + 7] == '=')) || 
     ((b[n + 0] == 'e' || b[n + 0] == 'E') && (b[n + 1] == 'n' || b[n + 1] == 'N') && (b[n + 2] == 'c' || b[n + 2] == 'C') && (b[n + 3] == 'o' || b[n + 3] == 'O') && (b[n + 4] == 'd' || b[n + 4] == 'D') && (b[n + 5] == 'i' || b[n + 5] == 'I') && (b[n + 6] == 'n' || b[n + 6] == 'N') && (b[n + 7] == 'g' || b[n + 7] == 'G') && (b[n + 8] == '=')) 
     ) 
    { 
     if (b[n + 0] == 'c' || b[n + 0] == 'C') n += 8; else n += 9; 
     if (b[n] == '"' || b[n] == '\'') n++; 
     int oldn = n; 
     while (n < taster && (b[n] == '_' || b[n] == '-' || (b[n] >= '0' && b[n] <= '9') || (b[n] >= 'a' && b[n] <= 'z') || (b[n] >= 'A' && b[n] <= 'Z'))) 
     { n++; } 
     byte[] nb = new byte[n-oldn]; 
     Array.Copy(b, oldn, nb, 0, n-oldn); 
     try { 
      string internalEnc = Encoding.ASCII.GetString(nb); 
      text = Encoding.GetEncoding(internalEnc).GetString(b); 
      return Encoding.GetEncoding(internalEnc); 
     } 
     catch { break; } // If C# doesn't recognize the name of the encoding, break. 
    } 
} 


// If all else fails, the encoding is probably (though certainly not 
// definitely) the user's local codepage! One might present to the user a 
// list of alternative encodings as shown here: https://stackoverflow.com/questions/8509339/what-is-the-most-common-encoding-of-each-language 
// A full list can be found using Encoding.GetEncodings(); 
text = Encoding.Default.GetString(b); 
return Encoding.Default; 

}

2

由於StringReader(),您可能會遇到執行問題。有人需要將原始字節轉換爲字符串。除非你在這之前做了一些特殊的事情,那麼.Net會爲你做這件事,並且通常會使用計算機的默認設置。

8位時代的問題在於第8位有幾十個實現(如果不是更多),並且沒有真正的方法可以從字節中告訴使用哪個實現。如果您使用的是ASCII碼,則第8位設置的任何內容都將轉換爲ASCII碼63 - ?。如果你使用的是UTF-8,任何第八位的設置將嘗試讀取下一個到五個字符(see Wikipedia for more info),如果這不起作用,它將被轉換爲UTF-8 65533 ,重新看。如果您手動指定編碼,如您正在給予的編碼koi8-r那麼第8位將被正確解析。以下是顯示此功能的示例代碼。而不是傾銷到Console我消息傳遞拳擊,但你可以切換,只要你記得change your console's encoding

var bytes = new byte[] { 226 }; 
var s1 = System.Text.Encoding.ASCII.GetString(bytes);//Invalid 
var s2 = System.Text.Encoding.UTF8.GetString(bytes);//Invalid 
var s3 = System.Text.Encoding.GetEncoding("koi8-r").GetString(bytes); //Б 

MessageBox.Show(String.Format("{0} {1} {2}", s1, s2, s3)); 

總之,如果你得到了UTF-8字符替代(你是),這意味着你已經失去了這些字節的原始值,你需要提前解決它。無論將字節轉換爲字符串需要考慮Content-Type: text/plain; charset="koi8-r";,事後都無法完成。