StreamDecoder vs InputStreamReader閱讀格式錯誤的文件

我遇到了一些奇怪的行爲，閱讀Java 8中的文件，我想知道是否有人可以理解它。StreamDecoder vs InputStreamReader閱讀格式錯誤的文件

場景：

讀取格式錯誤的文本文件。通過格式不正確，我的意思是它包含的字節不映射到任何unicode代碼點。

我使用創建這樣的文件中的代碼如下：

byte[] text = new byte[1]; 
char k = (char) -60; 
text[0] = (byte) k; 
FileUtils.writeByteArrayToFile(new File("/tmp/malformed.log"), text);

此代碼生成包含正好一個字節，這是不ASCII表的一部分（也沒有擴展一個）的文件。

試圖cat這個文件輸出如下：

�

哪個是UNICODE Replacement Character。這很有意義，因爲UTF-8需要2個字節才能解碼非ASCII字符，但我們只有一個。這是我期望從我的Java代碼中獲得的行爲。

粘貼一些常用代碼：

private void read(Reader reader) throws IOException { 

    CharBuffer buffer = CharBuffer.allocate(8910); 

    buffer.flip(); 

    // move existing data to the front of the buffer 
    buffer.compact(); 

    // pull in as much data as we can from the socket 
    int charsRead = reader.read(buffer); 

    // flip so the data can be consumed 
    buffer.flip(); 

    ByteBuffer encode = Charset.forName("UTF-8").encode(buffer); 
    byte[] body = new byte[encode.remaining()]; 
    encode.get(body); 

    System.out.println(new String(body)); 
}

這是我的第一種方法使用nio：

FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log")); 
read(Channels.newReader(inputStream.getChannel(), "UTF-8");

這將產生以下異常：

java.nio.charset.MalformedInputException: Input length = 1 

    at java.nio.charset.CoderResult.throwException(CoderResult.java:281) 
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) 
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) 
    at java.io.Reader.read(Reader.java:100)

這是不是我所期待但也有道理，因爲這實際上是一個腐敗和非法的f ile，而異常基本上告訴我們它期望更多的字節被讀取。

我的第二個（使用常規java.io）：

FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log")); 
read(new InputStreamReader(inputStream, "UTF-8"));

這並沒有失敗，產生完全相同的輸出cat也：

�

這也是情理之中。

所以我的問題是：

什麼是從Java應用程序在此方案中預期的行爲？
爲什麼使用Channels.newReader（返回StreamDecoder）和簡單地使用常規InputStreamReader有什麼區別？我是如何讀錯的？

任何澄清將不勝感激。

謝謝:)

來源

2017-08-01 Eli Polonsky

你注意到你沒有爲'InputStreamReader'指定'UTF-8'嗎？你的平臺默認編碼爲「UTF-8」還是別的？ 'InputStreamReader'也在內部使用'StreamDecoder'。 – Kayaman

「擴展一個」：哪個擴展了一個？ IBM437以任何順序使用全部256個字節的值。無論如何，你認爲一個文本文件會不正確嗎？您的應用程序中是否有某些部分需要處理錯誤的輸入？如果應用程序拒絕它，那麼錯誤的輸入是否可以在源處修復？換句話說，MalformedInputException在許多情況下是預期的行爲。 –

@Kayaman謝謝，我沒有注意到。但是我的平臺默認是UTF-8。我更改了代碼以指定Charset，並且行爲保持不變。（在這裏編輯代碼） –

行爲之間的差別其實去一直到StreamDecoder and Charset classes。該InputStreamReader會從StreamDecoder.forInputStreamReader(..)一個CharsetDecoder這確實對錯誤

StreamDecoder(InputStream in, Object lock, Charset cs) { 
    this(in, lock, 
    cs.newDecoder() 
    .onMalformedInput(CodingErrorAction.REPLACE) 
    .onUnmappableCharacter(CodingErrorAction.REPLACE)); 
}

更換而Channels.newReader(..)創建使用默認設置的解碼器（即報表，而不是取代，這導致一個異常時）

public static Reader newReader(ReadableByteChannel ch, 
           String csName) { 
    checkNotNull(csName, "csName"); 
    return newReader(ch, Charset.forName(csName).newDecoder(), -1); 
}

所以它們的工作方式不同，但沒有任何文檔說明差異。這是記錄錯誤的，但我假設他們改變了功能，因爲你寧願得到一個異常，而不是你的數據被破壞。

處理字符編碼時要小心！

來源

2017-08-11 07:38:07 Kayaman

StreamDecoder vs InputStreamReader閱讀格式錯誤的文件

回答

相關問題