在java中更改編碼

我正在寫一個函數，它應該檢測使用的字符集，然後將其切換到utf-8。我正在使用juniversalchardet，它是mozilla的universalchardet的java端口。
這是我的代碼：在java中更改編碼

private List<List<String>> setProperEncoding(List<List<String>> input) { 
    try { 

     // Detect used charset 
     UniversalDetector detector = new UniversalDetector(null); 

     int position = 0; 
     while ((position < input.size()) & (!detector.isDone())) { 
      String row = null; 
      for (String cell : input.get(position)) { 
       row += cell; 
      } 
      byte[] bytes = row.getBytes(); 
      detector.handleData(bytes, 0, bytes.length); 
      position++; 
     } 
     detector.dataEnd(); 

     Charset charset = Charset.forName(detector.getDetectedCharset()); 
     Charset utf8 = Charset.forName("UTF-8"); 
     System.out.println("Detected charset: " + charset); 

     // rewrite input using proper charset 
     List<List<String>> newLines = new ArrayList<List<String>>(); 
     for (List<String> row : input) { 
      List<String> newRow = new ArrayList<String>(); 
      for (String cell : row) { 
       //newRow.add(new String(cell.getBytes(charset))); 
       ByteBuffer bb = ByteBuffer.wrap(cell.getBytes(charset)); 
       CharBuffer cb = charset.decode(bb); 
       bb = utf8.encode(cb); 
       newRow.add(new String(bb.array())); 
      } 
      newLines.add(newRow); 
     } 

     return newLines; 

    } catch (Exception e) { 
     e.printStackTrace(); 
     return input; 
    } 
}

我的問題是，當我閱讀例如波蘭的字母，如L，A，C和similiar字母替換的字符文件？和其他奇怪的事情。我究竟做錯了什麼？編輯：編輯我使用eclipse。

方法參數是讀取MultipartFile的結果。只需使用FileInputStream獲取每一行，然後通過某個分隔符分割everyline（它已爲xls，xlsx和csv文件準備好）。沒有什麼特別的。

來源

2013-07-16 Pierwola

你是如何編譯你的代碼的？ Eclipse？命令提示符？螞蟻？ Maven？ – VirtualTroll

一旦你在'字符串'中輸入了字符，它們就已經是字符，而不是字節。 – gaborsch

「輸入」的來源是什麼？請爲此顯示您的代碼。 – gaborsch

首先，你的數據在二進制格式的某處。爲了簡單起見，我想它來自InputStream。

你想寫輸出爲UTF-8字符串，我想它可以是一個OutputStream。

我建議創建一個AutoDetectInputStream：

public class AutoDetectInputStream extends InputStream { 
    private InputStream is; 
    private byte[] sampleData = new byte[4096]; 
    private int sampleLen; 
    private int sampleIndex = 0; 

    public AutoDetectStream(InputStream is) throws IOException { 
     this.is = is; 
     // pre-read the data 
     sampleLen = is.read(sampleData); 
    } 

    public Charset getCharset() { 
     // detect the charset 
     UniversalDetector detector = new UniversalDetector(null); 
     detector.handleData(sampleData, 0, sampleLen); 
     detector.dataEnd(); 
     return detector.getDetectedCharset(); 
    } 

    @Override 
    public int read() throws IOException { 
     // simulate the stream for the reader 
     if(sampleIndex < sampleLen) { 
      return sampleData[sampleIndex++]; 
     } 
     return is.read(); 
    } 
}

第二個任務是很簡單因爲Java在UTF-8存儲字符串（字符），所以只需使用一個簡單的OutputStreamWriter。所以，這裏是你的代碼：

// open input with Detector stream 
// we use BufferedReader so we could read lines 
InputStream is = new FileInputStream("in.txt"); 
AutoDetectInputStream detector = new AutoDetectInputStream(is); 
Charset charset = detector.getCharset(); 
// here we can use the charset to decode the bytes into characters 
BufferedReader rdr = new BufferedReader(new InputStreamReader(detector, charset)); 

// open output to write to 
OutputStream os = new FileOutputStream("out.txt"); 
Writer utf8Writer = new OutputStreamWriter(os, Charset.forName("UTF-8")); 

// copy the whole file 
String line; 
while((line = rdr.readLine()) != null) { 
    utf8Writer.append(line); 
} 

// close streams   
rdr.close(); 
utf8Writer.flush(); 
utf8Writer.close();

所以，最後你得到所有的txt文件轉碼爲UTF-8。

請注意，緩衝區大小應該足夠大，以便輸入UniversalDetector。

來源

2013-07-16 16:41:18 gaborsch

完美的作品！謝謝！你是最棒的！更多 - 你是最好的！ – Pierwola

@Pierwola：D：D謝謝，我總是很高興看到我能不能幫助別人，他們也很欣賞它:) – gaborsch

它可以工作，但我的文本轉換爲「ћонгол」лсын≈р？нхийл？гч「улгарт？ рийн2223жил「。大多數字母是正確的，一些字母是錯的。郎是蒙古人。歡迎您的回覆：D – Enxtur

在java中更改編碼

回答

相關問題