在Java中檢測URL的編碼

我有一個數據庫中混合數據的情況，我試圖看看這是否是一個可以解決的問題。我所擁有的是三種格式之一的部分網址：在Java中檢測URL的編碼

/some/path?ugly=häßlich // case 1, Encoding: UTF-8 (plain) 
/some/path?ugly=h%C3%A4%C3%9Flich // case 2, Encoding: UTF-8 (URL-encoded) 
/some/path?ugly=h%E4%DFlich // case 3: Encoding: ISO-8859-1 (URL-encoded)

我需要在我的應用程序是URL編碼UTF8版本

/some/path?ugly=h%C3%A4%C3%9Flich // Encoding: UTF-8 (URL-encoded)

穎在DB都是UTF- 8，但是URL編碼可能存在也可能不存在，並且可能具有任何一種格式。

我有一個方法a編碼簡單的UTF-8 URL編碼UTF-8，和我有一個方法b解碼URL編碼ISO-8859-1爲純UTF-8，所以基本上是我計劃做的是：

殼體1：

String output = a(input);

殼體2：

String output = input;

殼體3：

String output = a(b(input));

所有這些情況下工作正常，如果我知道哪個是哪個，但有沒有一種安全的方式來檢測這樣的字符串是否是情況2或3？（我可以將參數中使用的語言限制爲歐洲語言：德語，英語，法語，荷蘭語，波蘭語，俄語，丹麥語，挪威語，瑞典語和土耳其語，如果有任何幫助的話）。

我知道顯而易見的解決方案是清理數據，但不幸的是，數據不是由我自己創建的，也不是具有必要技術理解的人員（並且有大量需要工作的遺留數據）

來源

2012-07-10 Sean Patrick Floyd

只是字符（如你的例子）和數字編碼？ – s106mo 2012-07-10 20:24:08

@ s106mo是的，應用程序是一個重定向到一個更好的搜索查詢。而那些按照定義是字母數字。感謝您的建議 – 2012-07-10 21:21:42

如果你可以假設，只有字母數字編碼，以下woud的工作：

「häßlich」
「H％C3％A4％C3％9Flich」
「H％E4 ％DFlich「

//檢查首先：

public static boolean isUtf8Encoded(String url) { 
    return isAlphaNumeric(url); 
} 

public static boolean isUrlUtf8Encoded(String url) 
     throws UnsupportedEncodingException { 
    return isAlphaNumeric(URLDecoder.decode(url, "UTF-8")); 
} 

public static boolean isUrlIsoEncoded(String url) 
     throws UnsupportedEncodingException { 
    return isAlphaNumeric(URLDecoder.decode(url, "ISO-8859-1")); 
} 

private static boolean isAlphaNumeric(String decode) { 
    for (char c : decode.toCharArray()) { 
     if (!Character.isLetterOrDigit(c)) { 
      return false; 
     } 
    } 
    return true; 
}

來源

2012-07-10 20:33:34 s106mo

感謝接受的答案，但它並不適用於URL工作，因爲URL還包含控制字符，這是我的解決方案：

/** 
* List of valid characters in URL. 
*/ 
private static final List VALID_CHARACTERS = Arrays.asList(
     '-', '.', '_', '~', ':', '/', '?', '#', '[', ']', '@', '!', 
     '$', '&', '\'', '(', ')', '*', '+', ',', ';', '=' 
); 

/** 
* Check that decoding was successful or not. 
* @param url URL to check 
* @return True if it's valid. 
*/ 
private static boolean isMalformed(final String url) { 
    for (char c : url.toCharArray()) { 
     if (VALID_CHARACTERS.indexOf(c) == -1 && !Character.isLetterOrDigit(c)) { 
      return false; 
     } 
    } 
    return true; 
} 

/** 
* Try to decode URL with specific encoding. 
* @param url URL 
* @param encoding Valid encoding 
* @return Decoded URL or null of encoding is not write 
* @throws java.io.UnsupportedEncodingException Throw if encoding does not support on your system. 
*/ 
private static String _decodeUrl(final String url, final String encoding) { 
    try { 
     final String decoded = URLDecoder.decode(url, encoding); 
     if(isMalformed(decoded)) { 
      return decoded; 
     } 
    } 
    catch (UnsupportedEncodingException ex) { 
     throw new IllegalArgumentException("Illegal encoding: " + encoding); 
    } 
    return null; 
} 

/** 
* Decode URL with most popular encodings for URL. 
* @param url URL 
* @return Decoded URL or original one if encoding does not support. 
*/ 
public static String decodeUrl(final String url) { 
    final String[] mostPopularEncodings = new String[] {"iso-8859-1", "utf-8", "GB2312"}; 
    return decodeUrl(url, mostPopularEncodings); 
} 

/** 
* Decode URL with most popular encodings for URL. 
* @param url URL 
* @param encoding Encoding 
* @return Decoded URL or original one if encoding does not support. 
*/ 
public static String decodeUrl(final String url, final String... encoding) { 
    for(String e:encoding) { 
     final String decoded; 
     if((decoded = _decodeUrl(url, e)) != null) { 
      return decoded; 
     } 
    } 
    return url; 
}

來源

2014-06-24 05:09:18 user1079877

不錯，但不是Character對象，而是一個[Guava'CharMatcher']（http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/base/CharMatcher.html ）會更高效 – 2014-06-24 07:25:49

謝謝，但我認爲它在內部也使用isLetterOrDigit！如果我不使用Google庫，誰又會怎樣？ – user1079877 2014-06-25 09:29:04

不，它不。它被優化使用位表進行查找。不要使用Google庫：也許你應該重新考慮。他們是那裏最好的開源庫之一 – 2014-06-25 09:57:44

你可以在第一次解碼時進行解碼然後進行編碼，如果您有未編碼的網址，則不會受解碼影響

String url = "your url"; 
    url=URIUtil.decode(url, "UTF-8"); 
    url=URIUtil.encodeQuery(url, "UTF-8");

來源

2016-10-12 12:27:46 Elsayed

我認爲你的意思是[Apache HttpComponents的'URIUtil']（https://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/org/apache/ HTTP /客戶端/ utils的/ URIUtils.html） – 2016-10-12 13:16:34

在Java中檢測URL的編碼

回答

相關問題