在Java中使用Windows 1252轉換爲UTF8：使用CharsetDecoder/Encoder的空字符

我知道這是一個非常普遍的問題，但我變得生氣了。在Java中使用Windows 1252轉換爲UTF8：使用CharsetDecoder/Encoder的空字符

我用這個代碼：

String ucs2Content = new String(bufferToConvert, inputEncoding);   
     byte[] outputBuf = ucs2Content.getBytes(outputEncoding);   
     return outputBuf;

但我讀到最好使用CharsetDecoder和CharsetEncoder（我有內容，有一些字符可能是目的地編碼外）。我剛剛寫了這個代碼，但也存在一些問題：

// Create the encoder and decoder for Win1252 
Charset charsetInput = Charset.forName(inputEncoding); 
CharsetDecoder decoder = charsetInput.newDecoder(); 

Charset charsetOutput = Charset.forName(outputEncoding); 
CharsetEncoder encoder = charsetOutput.newEncoder(); 

// Convert the byte array from starting inputEncoding into UCS2 
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert)); 

// Convert the internal UCS2 representation into outputEncoding 
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf)); 
return bbuf.array();

事實上這段代碼追加到緩衝空字符序列!!!!!

有人能告訴我問題在哪裏嗎？我不熟悉Java中的編碼轉換。

有沒有更好的方法來轉換Java中的編碼？

來源

2011-05-25 robob

您的問題是ByteBuffer.array()返回的直接引用作爲後備存儲字節緩衝區，而不是支持數組的有效範圍的副本的陣列。你必須服從bbuf.limit()（正如Peter在他的回覆中所做的那樣），並且只使用索引0到bbuf.limit()-1的數組內容。

支持數組中額外的0值的原因是由CharsetEncoder創建ByteBuffer的方式略有缺陷。每個CharsetEncoder都有一個「每個字符的平均字節數」，這對於UCS2編碼器來說似乎是簡單而正確的（2字節/字符）。遵循該固定值，CharsetEncoder最初分配具有「字符串長度*每個字符的平均字節數」字節的ByteBuffer，在這種情況下，例如， 20個字節用於10個字符的長字符串。然而，UCS2 CharsetEncoder以BOM（字節順序標記）開始，它也佔用2個字節，因此只有9個字符適合分配的ByteBuffer。 CharsetEncoder檢測溢出並分配長度爲2 * n + 1（n是ByteBuffer的原始長度）的新ByteBuffer，在此情況下爲2 * 20 + 1 = 41個字節。由於只需要21個新字節中的2個來編碼剩餘字符，因此從bbuf.array()獲得的數組長度將爲41個字節，但bbuf.limit()將指示實際僅使用前22個條目。

來源

2011-05-26 09:50:15 jarnbjo

謝謝，你可能只是救了我幾個小時的挫折 – pepsi 2011-08-11 19:29:58

我不知道你如何得到一個null字符序列。試試這個

String outputEncoding = "UTF-8"; 
Charset charsetOutput = Charset.forName(outputEncoding); 
CharsetEncoder encoder = charsetOutput.newEncoder(); 

// Convert the byte array from starting inputEncoding into UCS2 
byte[] bufferToConvert = "Hello World! £€".getBytes(); 
CharBuffer cbuf = decoder.decode(ByteBuffer.wrap(bufferToConvert)); 

// Convert the internal UCS2 representation into outputEncoding 
ByteBuffer bbuf = encoder.encode(CharBuffer.wrap(cbuf)); 
System.out.println(new String(bbuf.array(), 0, bbuf.limit(), charsetOutput));

打印

Hello World! £€

來源

2011-05-25 16:37:17

但是你必須從輸入編碼中聲明一個CharsetDecoder。類似於：CharsetDecoder decoder = charsetInput.newEncoder（）其中charsetInput = Charset.forName（「cp1252」）。 – robob 2011-05-25 17:37:14

還有一個類似的問題：你認爲空序列可能是缺失的「解碼器」嗎？http://stackoverflow.com/questions/1252468/java-converting-string-to-and-from-bytebuffer-and-associated-problems – robob 2011-05-25 17:44:06

.flush「和」encoder.flush「？我也看到你沒有在你的代碼中使用flush（）... – robob 2011-05-25 18:06:21

在Java中使用Windows 1252轉換爲UTF8：使用CharsetDecoder/Encoder的空字符

回答

相關問題