下載xml，刪除bom並編碼utf8

我正在從FTP服務器下載XML。我必須爲我的SAX解析器做好準備。爲此，我需要刪除BOM字節並將其編碼爲UTF-8。但不知何故，它不適用於每個文件。下載xml，刪除bom並編碼utf8

這裏是我的兩個功能代碼：

public static void copy(File src, File dest){ 

    try { 
     byte[] data = Files.readAllBytes(src.toPath()); 

     writeAsUTF8(dest, skipBom(data)); 

    } catch (IOException e) { 
     e.printStackTrace(); 
    } 
} 


private static void writeAsUTF8(File out, byte[] data){ 

    try { 

     FileOutputStream outStream = new FileOutputStream(out); 
     OutputStreamWriter outUTF = new OutputStreamWriter(outStream,"UTF8"); 

     outUTF.write(new String(data, "UTF8")); 
     //outUTF.write(new String(data)); 
     outUTF.flush(); 
     outStream.close(); 
     outUTF.close(); 
    } 
    catch(Exception ex){ 
     ex.printStackTrace(); 
    } 
} 

    private static byte[] skipBom(byte[] data){ 

    int skipBytes = getBomSize(data); 

    byte[] tmp = new byte[data.length - skipBytes]; 

    for(int x = 0; x < tmp.length; x++){ 
     tmp[x] = data[x + skipBytes]; 
    } 

    return tmp; 
}

任何想法我做錯了什麼？

來源

2014-01-27 Adam Sam

您是否嘗試過任何的想法，從[這個問題]（http://stackoverflow.com/questions/1835430/byte-order -mark螺絲-UP-文件讀入的Java /）？ – andyb

爲什麼要刪除BOM字節？你只需要用文件的編碼將文件讀入字符串，然後使用UTF-8編碼將字符串寫入文件。

來源

2014-01-27 14:29:18 fatih

我不會，但隨後在與SAX解析器讀取它（第1行的符號是無效的，或者類似的東西） –

你們用什麼飼料的SAX解析器我得到一個異常？當你提供一個包含閱讀器的輸入源時（知道字節必須被讀作utf-8），那麼一切都應該沒問題。或者我理解錯了什麼？ – fatih

@faith：不，這並不總是奏效。如果輸入流中的第一個字節是BOM，那麼SAX會抱怨非法字節並引發異常。在將數據交給SAX之前，您需要擺脫第一個字節。 – alexraasch

我找不出你的代碼有什麼問題。我前段時間遇到同樣的問題，我使用下面的代碼來做到這一點。首先，下面的函數讀取跳過第一個字節的文件。當然，如果您確定所有文件都有BOM，這當然是有道理的。

public byte[] load (File inputFile, int lines) throws Exception { 

    try (BufferedReader reader 
     = new BufferedReader(
      new InputStreamReader(
       new FileInputStream(inputFile), "UTF-8"))) 
    { 
     // Discard the Byte Order Mark 
     int firstByte = reader.read(); 

     String line = null; 
     int lineCount = 0; 

     StringBuilder builder = new StringBuilder(); 
     while(lineCount <= lines && (line = reader.readLine()) != null) { 
      lineCount += 1; 
      builder.append(line + "\n"); 
     } 
    } 

    return builder.toString().getBytes(); 
}

您可以重寫上述函數，以UTF-8將數據寫回另一個文件。我偶爾使用以下方法轉換磁盤上的文件以將其從ISO轉換爲UTF-8：

public static void convertToUTF8 (Path p) throws Exception { 
    Path docPath = p; 
    Path docPathUTF8 = docPath; 

    InputStreamReader in = new InputStreamReader(new FileInputStream(docPath.toFile()), StandardCharsets.ISO_8859_1); 

    CharBuffer cb = CharBuffer.allocate(100 * 1000 * 1000); 
    int c = -1; 

    while ((c = in.read()) != -1) { 
     cb.put((char) c); 
    } 
    in.close(); 

    OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(docPathUTF8.toFile()), StandardCharsets.UTF_8); 

    char[] x = new char[cb.position()]; 
    System.arraycopy(cb.array(), 0, x, 0, x.length); 

    out.write(x); 
    out.flush(); 
    out.close(); 
}

來源

2014-01-27 14:41:39 alexraasch

簡化。

writeAsUTF8(dest, data); 



try { 
    int BOM_LENGTH = "\uFFFE".getBytes(StandardCharsets.UTF_8); 
    if (!new String(data, 0, BOM_LENGTH).equals("\uFFFE")) { 
     BOM_LENGTH = 0; 
    } 
    FileOutputStream outStream = new FileOutputStream(out); 
    outStream.write(data, BOM_LENGTH, data.length - BOM_LENGTH)); 
    outStream.close(); 
} 
catch(Exception ex){ 
    ex.printStackTrace(); 
}

這檢查BOM（U + FFFE）是否存在。僅讀出全部作爲字符串將是更簡單的：

String xml = new String(data, StandardCharsets.UTF_8); 
xml = xml.replaceFirst("^\uFFFE", "");

使用字符集，而不是字符串編碼參數是指一個異常少捉：UnsupportedEncodingException（一個IOException）。

檢測XML編碼：

String xml = new String(data, StandardCharsets.ISO_8859_1); 
String encoding = xml.replaceFirst(
     "(?s)^.*<\\?xml.*encoding=([\"'])([\\w-]+)\\1.*\\?>.*$", 
     "$2"); 

if (encoding.equals(xml)) { 
    encoding = "UTF-8"; 
} 
xml = new String(data, encoding); 
xml = xml.replaceFirst("^\uFFFE", "");

來源

2014-01-27 14:44:15

BOM不是問題，刪除它始終有效。主要問題是編碼，我正在用.readAllBytes（）讀取字節，然後嘗試將它保存爲utf-8。源文件可以有任何編碼，但最後它必須是utf8。 –

使用XML中聲明的編碼添加。 –

此「」（αS）「^。* <\\？XML。*編碼=（\」']）（\ W +）\\ 1。* \\？>。* $」，「2 $」）;」在編碼外來'「'，缺少反斜槓，忘了'-'： doestn工作 –

下載xml，刪除bom並編碼utf8

回答

相關問題