解碼 - 優文庫

我一直在考慮一段文字代表HTML如Windows 1252的組合，並引述可打印的HTML：解碼

<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n

從HTML <meta>標籤我可以看到一塊HTML應編碼爲Windows的1252。

我正在使用node.js來解析這段文字cheerio。但使用https://github.com/mathiasbynens/windows-1252對其進行解碼沒有幫助：windows1252.decode(myString);正在返回相同的輸入字符串。

我想原因是因爲輸入字符串標準node.js的字符集已經編碼，但它實際上代表一個windows-1252編碼的HTML的一部分（如果是有道理的？）。

檢查由=前面加上那些奇怪的十六進制數字，我可以看到有效windows-1252代碼，例如：

這個=\r\n這\r\n應該以某種方式代表在Windows世界回車，
=3D：HEX 3D是DEC 61這是一個等號：=,
=96：HEX 96是DEC 150其中是一個「破折號」符號：–（某種「長減號」），
=A3：HEX A3是DEC 163這是一個井號：£

我沒有控制這一段HTML的代，但我應該解析它，並清理它給予£（而不是=A3）等。

現在，我知道我可以保留與轉換的內存映射，但我是想知道是否已經有涵蓋整個windows-1252字符集的程序化解決方案？

參考這對於整個轉換表：https://www.w3schools.com/charsets/ref_html_ansi.asp

編輯：

輸入HTML來自一個IMAP會話，因此它似乎有一個7位/ 8「引用的可打印編碼」走出上游，我無法控制（參見https://en.wikipedia.org/wiki/Quoted-printable）。

在此期間，我開始意識到這種額外的編碼，我試過這個quoted-printable（參考https://github.com/mathiasbynens/quoted-printable）庫沒有運氣。

下面是一個MCV（根據請求）：

var cheerio = require('cheerio'); 
var windows1252 = require('windows-1252'); 
var quotedPrintable = require('quoted-printable'); 

const inputString = '<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n' 
const $ = cheerio.load(inputString, {decodeEntities: true}); 
const bodyContent = $('html body').text().trim(); 
const decodedBodyContent = windows1252.decode(bodyContent); 

console.log(`The input string: "${bodyContent}"`); 
console.log(`The output string: "${decodedBodyContent}"`); 

if (bodyContent === decodedBodyContent) { 
    console.log('The windows1252 output seems the same of as the input'); 
} 

const decodedQp = quotedPrintable.decode(bodyContent) 
console.log(`The decoded QP string: "${decodedQp}"`);

先前腳本產生以下輸出：

The input string: "This should be a pound sign: =A3 and this should be a long dash: =96" 
The output string: "This should be a pound sign: =A3 and this should be a long dash: =96" 
The windows1252 output seems the same of as the input 
The decoded QP string: "This should be a pound sign: £ and this should be a long dash: "

在我的命令行我不能看到長劃線和我不知道如何正確解碼所有這些=<something>編碼的字符？

來源

2017-10-17 TPPZ

看起來，你是非常不走運的在這裏。 – awd

我認爲你需要提供更完整的[mcve]。首先，文本如何進入你的程序？ – Quentin

似乎通過IMAP接收到的消息被提供了2個不同的編碼組合：

實際字符串根據「引用的可打印」編碼（https://en.wikipedia.org/wiki/Quoted-printable），因爲我覺得有一個問題與編碼的7位/ 8位映射經由IMAP信道傳輸的信息（TCP套接字連接）
是HTML與<meta>標籤與Windows 1252字符集

含量（電子郵件正文）的邏輯表示當

這些HTML塊在Windows風格中包含大量回車（\r\n）也存在「問題」。在我的情況下，我必須預先處理字符串來處理這個問題：刪除那些回車符。

下MCV例子應該顯示清洗的過程和驗證字符串的代表電子郵件正文內容：

var quotedPrintable = require('quoted-printable'); 
var windows1252 = require('windows-1252'); 

const inputStr = 'This should be a pound sign: =A3 \r\nand this should be a long dash: =96\r\n'; 
console.log(`The original string: "${inputStr}"`); 

// 1. clean the "Windows carriage returns" (\r\n) 
const cleandStr = inputStr.replace(/\r\n/g, ''); 
console.log(`The string without carriage returns: "${cleandStr}"`); 

// 2. decode using the "quoted printable protocol" 
const decodedQp = quotedPrintable.decode(cleandStr) 
console.log(`The decoded QP string: "${decodedQp}"`); 

// 3. decode using the "windows-1252" 
const windows1252DecodedQp = windows1252.decode(decodedQp); 
console.log(`The windows1252 decoded QP string: "${windows1252DecodedQp}"`);

哪個給出了這樣的輸出：

The original string: "This should be a pound sign: =A3 
and this should be a long dash: =96 
" 
The string without carriage returns: "This should be a pound sign: =A3 and this should be a long dash: =96" 
The decoded QP string: "This should be a pound sign: £ and this should be a long dash: " 
The windows1252 decoded QP string: "This should be a pound sign: £ and this should be a long dash: –"

通知「長破折號「在Windows-1252解碼階段之前/之後呈現不同。

Afaik，這與UTF-8編碼/解碼無關。我能夠弄清楚這個過程的「解碼順序」：https://github.com/mathiasbynens/quoted-printable/issues/5

我不確定的一件事是，如果我正在運行這段代碼的操作系統對字符集/編碼有某種影響文件或字符串流。

我已經使用了npm包：

來源

2017-10-18 12:40:01 TPPZ

解碼

回答

相關問題