2017-08-24 105 views
0

我發現有同樣的問題別人,他們的問題是由在InputStreamReader的構造函數指定UTF-8解決:如何從InputStream正確讀取Unicode?

Reading InputStream as UTF-8

https://www.mkyong.com/java/how-to-read-utf-8-encoded-data-from-a-file-java/

這不是爲我工作,我不知道爲什麼。無論我嘗試什麼,我都會收到轉義的unicode值(斜槓-U +十六進制),而不是實際的語言字符。我在這裏做錯了什麼?提前致謝!

// InputStream is is a FileInputStream: 
public void load(InputStream is) throws Exception { 

    BufferedReader br = null; 

    try { 
     // Passing "UTF8" or "UTF-8" to this constructor makes no difference for me: 
     br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8)); 
     String line = null;   
     while ((line = br.readLine()) != null) { 
      // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好" 
      System.out.println("got line: " + line); 
     } 
    } finally { 
     if (br != null) { 
      br.close(); 
     } 
    }  
} 

請注意:這不是字體問題。我知道這一點,因爲如果我對同一個文件使用ResourceBundle,我會正確地獲取在IDE控制檯中打印的中文字符。但是,每當我嘗試使用FileInputStream手動讀取文件時,都會將字符轉換爲斜槓/ u約定。即使我告訴它使用UTF-8編碼。我也試着修改項目的編碼JVM參數,但仍然沒有喜悅。再次感謝您的任何建議。

此外,使用ResourceBundle作爲最終解決方案不適合我。這個特定的項目有合理的理由,爲什麼它不適合這項工作,爲什麼我要自己明確地做到這一點。

編輯:我試着從InputStream手動拉字節,完全繞過InputStreamReader和它的構造函數,這似乎忽略了我的編碼參數。這隻會導致相同的行爲。斜槓+ U慣例而不是正確的字符。很難理解,爲什麼我不能像其他人一樣工作。我是否可能在某處設置了系統/操作系統來覆蓋Java正確處理unicode的能力?我在Windows 7版本6.1(也是64位)上使用Java版本1.8.0_65(64位)。

public void load(InputStream is) throws Exception {  
    String line = null;  
    try { 
     while ((line = readLine(is)) != null) { 
      // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好" 
      System.out.println("got line: " + line);     
     }   
    } finally { 
     is.close(); 
    }  
} 

private String readLine(InputStream is) throws Exception {  
    List<Byte> bytesList = new ArrayList<>();  
    while (true) { 
     byte b = -1; 

     try { 
      b = (byte)is.read(); 
     } catch (EOFException e) { 
      return bytesToString(bytesList); 
     }   
     if (b == -1) { 
      return bytesToString(bytesList); 
     } 
     char ch = (char)b; 
     if (ch == '\n') { 
      return bytesToString(bytesList); 
     } 
     bytesList.add(b); 
    }  
} 

private String bytesToString(List<Byte> bytesList) {   
    if (bytesList.isEmpty()) { 
     return null; 
    }  
    byte[] bytes = new byte[bytesList.size()]; 
    for (int i = 0; i < bytes.length; i++) { 
     bytes[i] = bytesList.get(i); 
    }  
    return new String(bytes, 0, bytes.length); 
} 

回答

0

如果有其他人遇到同樣的麻煩,我能找到解決方案。由於ResourceBundle總是爲我做正確的事情,我深入研究了爲什麼會這樣,並發現java.util.Properties使用loadConvert()函數完成所有的魔術。在BufferedReader從文件中給出一行文本後,我需要明確解碼該字符串中的Unicode轉義字符,類似如下:

public void load(InputStream is) throws Exception { 

    BufferedReader br = null; 

    try { 
     // Passing "UTF8" or "UTF-8" to this constructor makes no difference for me: 
     br = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8)); 
     String line = null;   
     while ((line = br.readLine()) != null) { 
      // The following prints "got line: chinese = \u4f60\u597d" instead of "got line: chinese = 你好" 
      System.out.println("got line: " + line); 
      line = decodeUni(line); 
      // The following prints "decoded line: chinese = 你好" exactly as it should! 
      System.out.println("decoded line: " + line); 
     } 
    } finally { 
     if (br != null) { 
      br.close(); 
     } 
    }  
} 

// Converts encoded "\\uxxxx" to unicode chars 
private String decodeUni(String string) { 

    char[] charsIn = string.toCharArray(); 
    int len = charsIn.length; 
    char[] charsOut = new char[len]; 
    char ch; 
    int outLen = 0; 
    int off = 0; 
    int end = off + len; 

    while (off < end) { 
     ch = charsIn[off++]; 
     // Does aChar start with "\\u" ? 
     if (ch == '\\') { 
      ch = charsIn[off++]; 
      if(ch == 'u') { 
       // Yep! Convert the hex part to the correct character. 
       int value = 0; 
       for (int i = 0; i < 4; i++) { 
        ch = charsIn[off++]; 
        switch (ch) { 
         case '0': case '1': case '2': case '3': case '4': 
         case '5': case '6': case '7': case '8': case '9': { 
          value = (value << 4) + ch - '0'; 
          break; 
         } 
         case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': { 
          value = (value << 4) + 10 + ch - 'a'; 
          break; 
         } 
         case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': { 
          value = (value << 4) + 10 + ch - 'A'; 
          break; 
         } 
         default: throw new IllegalArgumentException("Malformed \\uxxxx encoding: " + string); 
        } 
       } 
       charsOut[outLen++] = (char)value; 
      } else { 
       // Starts with a slash but not "\\u", handle the other possible escaped characters. 
       switch (ch) { 
        case 't': 
         ch = '\t'; 
         break; 
        case 'r': 
         ch = '\r'; 
         break; 
        case 'n': 
         ch = '\n'; 
         break; 
        case 'f': 
         ch = '\f'; 
         break; 
        default: 
         break; 
       } 
       charsOut[outLen++] = ch; 
      } 
     } else { 
      // Doesn't start with a slash, leave as-is. 
      charsOut[outLen++] = ch; 
     } 
    } 
    return new String(charsOut, 0, outLen).trim(); 
}