URL在Java中的非ASCII字符

我一直在使用java.net.URI中的類來完成這項工作試圖解碼URL解碼，但它並不總是工作正確。

String test = "https://fr.wikipedia.org/wiki/Fondation_Alliance_fran%C3%A7aise"; 
URI uri = new URI(test); 
System.out.println(uri.getPath());

對於測試字符串「https://fr.wikipedia.org/wiki/Fondation_Alliance_fran%C3%A7aise」，結果是正確的「/維基/Fondation_Alliance_française」（％C3％A7被正確地被C取代）。

但是對於其他一些測試字符串，如「http://sv.wikipedia.org/wiki/Anv%E4ndare:Lsjbot/Statistik#Drosophilidae」，它給出了不正確的結果「/ wiki /Anv ndare：Lsjbot/Statistik」（％E4被替換爲而不是replaced）。

我用getRawPath（）和URLDecoder類做了一些測試。

System.out.println(URLDecoder.decode(uri.getRawPath(), "UTF8")); 
System.out.println(URLDecoder.decode(uri.getRawPath(), "ISO-8859-1")); 
System.out.println(URLDecoder.decode(uri.getRawPath(), "WINDOWS-1252"));

根據測試字符串，我得到正確的結果有不同的編碼：

對於％C3％A7，我與「UTF-8」編碼如預期，和不正確的一個正確的結果結果以「ISO-8859-1」或「WINDOWS-1252」編碼
對於％E4，情況正好相反。

對於這兩個測試網址，如果我將它們放在Chrome地址欄中，我會得到正確的頁面。

如何在所有情況下正確解碼URL？感謝您的幫助

==== ====答案

由於在麥克道爾的建議回答以下，現在看來工作。這是我現在的代碼：

private static void appendBytes(ByteArrayOutputStream buf, String data) throws UnsupportedEncodingException { 
    byte[] b = data.getBytes("UTF8"); 
    buf.write(b, 0, b.length); 
} 

private static byte[] parseEncodedString(String segment) throws UnsupportedEncodingException { 
    ByteArrayOutputStream buf = new ByteArrayOutputStream(segment.length()); 
    int last = 0; 
    int index = 0; 
    while (index < segment.length()) { 
    if (segment.charAt(index) == '%') { 
     appendBytes(buf, segment.substring(last, index)); 
     if ((index < segment.length() + 2) && 
      ("ABCDEFabcdef".indexOf(segment.charAt(index + 1)) >= 0) && 
      ("ABCDEFabcdef".indexOf(segment.charAt(index + 2)) >= 0)) { 
     buf.write((byte) Integer.parseInt(segment.substring(index + 1, index + 3), 16)); 
     index += 3; 
     } else if ((index < segment.length() + 1) && 
       (segment.charAt(index + 1) == '%')) { 
     buf.write((byte) '%'); 
     index += 2; 
     } else { 
     buf.write((byte) '%'); 
     index++; 
     } 
     last = index; 
    } else { 
     index++; 
    } 
    } 
    appendBytes(buf, segment.substring(last)); 
    return buf.toByteArray(); 
} 

private static String parseEncodedString(String segment, Charset... encodings) { 
    if ((segment == null) || (segment.indexOf('%') < 0)) { 
    return segment; 
    } 
    try { 
    byte[] data = parseEncodedString(segment); 
    for (Charset encoding : encodings) { 
     try { 
     if (encoding != null) { 
      return encoding.newDecoder(). 
       onMalformedInput(CodingErrorAction.REPORT). 
       decode(ByteBuffer.wrap(data)).toString(); 
     } 
     } catch (CharacterCodingException e) { 
     // Incorrect encoding, try next one 
     } 
    } 
    } catch (UnsupportedEncodingException e) { 
    // Nothing to do 
    } 
    return segment; 
}

來源

2014-02-20 NicoV

注意URLDecoder不適合解碼URI路徑;它適用於大多數情況，但不是全部。 – fge

我知道，我只是試圖使用它，因爲在所有情況下，URI類並沒有給我正確的答案，並且在這個問題中提供了更多的信息。 – NicoV

ANV％E4ndare

由於PopoFibo says這是不是一個合法的UTF-8編碼的序列。

你可以做一些寬容最好的猜測解碼：

public static String parse(String segment, Charset... encodings) { 
    byte[] data = parse(segment); 
    for (Charset encoding : encodings) { 
    try { 
     return encoding.newDecoder() 
      .onMalformedInput(CodingErrorAction.REPORT) 
      .decode(ByteBuffer.wrap(data)) 
      .toString(); 
    } catch (CharacterCodingException notThisCharset_ignore) {} 
    } 
    return segment; 
} 

private static byte[] parse(String segment) { 
    ByteArrayOutputStream buf = new ByteArrayOutputStream(); 
    Matcher matcher = Pattern.compile("%([A-Fa-f0-9][A-Fa-f0-9])") 
          .matcher(segment); 
    int last = 0; 
    while (matcher.find()) { 
    appendAscii(buf, segment.substring(last, matcher.start())); 
    byte hex = (byte) Integer.parseInt(matcher.group(1), 16); 
    buf.write(hex); 
    last = matcher.end(); 
    } 
    appendAscii(buf, segment.substring(last)); 
    return buf.toByteArray(); 
} 

private static void appendAscii(ByteArrayOutputStream buf, String data) { 
    byte[] b = data.getBytes(StandardCharsets.US_ASCII); 
    buf.write(b, 0, b.length); 
}

此代碼將成功解碼給定的字符串：

for (String test : Arrays.asList("Fondation_Alliance_fran%C3%A7aise", 
    "Anv%E4ndare")) { 
    String result = parse(test, StandardCharsets.UTF_8, 
     StandardCharsets.ISO_8859_1); 
    System.out.println(result); 
}

請注意，這不是一些簡單的系統，可以讓你忽略正確的URL編碼。它在這裏工作，因爲v％E4n - 字節序列76 E4 6E - 不是根據the UTF-8 scheme的有效序列，解碼器可以檢測到這一點。

如果反轉編碼的順序，第一個字符串可以愉快（但不正確）解碼爲ISO-8859-1。

注：HTTP doesn't care約百分號編碼，你可以寫一個接受http://foo/%%%%%爲有效形式的Web服務器。URI spec要求使用UTF-8，但這是追溯性的。服務器真正需要描述它的URI應該是什麼形式，如果你必須處理任意的URI，你需要知道這個遺留問題。我寫了more about URLs and Java here。

來源

2014-02-20 12:15:11 McDowell

+1，非常詳細和有幫助 – PopoFibo

@McDowell非常感謝，我會在回家時嘗試您的解決方案。 – NicoV

很好用，我用我現在使用的實際代碼編輯了我的最初問題。 – NicoV

URL在Java中的非ASCII字符

回答

相關問題