2012-06-14 40 views
9

我想解析郵件字符串與Ruby mail gem,我有一個魔鬼的字符編碼的時間。看看下面的電子郵件:字符編碼與Ruby 1.9.3和郵件寶石

MIME-Version: 1.0 
Sender: [email protected] 
Received: by 10.142.239.17 with HTTP; Thu, 14 Jun 2012 06:00:18 -0700 (PDT) 
Date: Thu, 14 Jun 2012 09:00:18 -0400 
Delivered-To: [email protected] 
X-Google-Sender-Auth: MxfFrMybNjBoBt4O4GwAn9cMsko 
Message-ID: <[email protected]om> 
Subject: Re: [Lorem Ipsum] Foo updated the forum topic 'Reply by email test' 
From: Foo Bar <[email protected]> 
To: Foo <[email protected]> 
Content-Type: text/plain; charset=ISO-8859-1 
Content-Transfer-Encoding: quoted-printable 

This email has accents:=A0R=E9sum=E9 
> 
> --------- Reply Above This Line ------------ 
> 
> Email parsing with accents: R=E9sum=E9 
> 
> Click here to view this post in your browser 

電子郵件正文,當編碼正確,應該是:

This reply has accents: Résumé 
> 
> --------- Reply Above This Line ------------ 
> 
> Email parsing with accents: Résumé 
> 
> Click here to view this post in your browser 

不過,我在的時候實際上得到了重音符號來通過一個魔鬼。以下是我已經試過:

message = Mail.new(email_string) 
body = message.body.decoded 

這會激發我開始像這樣的字符串:最後

This reply has accents:\xA0R\xE9sum\xE9\r\n>\r\n> --------- Reply Above This Line ------------ 

,我試試這個:

body.encoding # => <Encoding:ASCII-8BIT> 
body.encode("UTF-8") # => Encoding::UndefinedConversionError: "\xA0" from ASCII-8BIT to UTF-8 

有沒有人有任何建議如何處理這個?我很確定它與電子郵件中的「charset = ISO-8859-1」設置有關,但我不確定如何使用該設置,或者如果有方法可以使用郵件gem輕鬆提取。

回答

17

打了一下後,我發現這一點:

body.decoded.force_encoding("ISO-8859-1").encode("UTF-8") # => "This reply has accents: Résumé..." 
message.parts.map { |part| part.decoded.force_encoding("ISO-8859-1").encode(part.charset) } # multi-part 

您可以從郵件中提取的字符集,像這樣。

message.charset #=> for simple, non-multipart 
message.parts.map { |part| part.charset } #=> for multipart, each part can have its own charset 

小心非多,如下面的會引起麻煩:

body.charset #=> returns "US-ASCII" which is WRONG! 
body.force_encoding(body.charset).encode("UTF-8") #=> Conversion error... 

body.force_encoding(message.charset).encode("UTF-8") #=> Correct conversion :) 
+1

太棒了。一直在尋找這個。結束這樣做:body = message.text_part.encode('UTF-8',message.text_part.charset,:invalid =>:replace,:undef =>:replace) –

+0

真棒......感謝tun ... – Jyothu

+0

某些部分似乎沒有字符集。我不確定如何處理這些問題。 –

0

這並沒有爲我工作,所以想我會堅持下去了,我得的情況下,該解決方案它可以幫助任何人...

基本上必須添加編碼默認值,並將輸出調整爲合理的字符串。 https://stackoverflow.com/a/26604049/2386548