2011-03-23 42 views
2

字符串如何正確使用Java解碼在Java中

http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D%u0420%A0%u0421%u045F%u0420%A0%u0421%u2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%u201D+%u0420%A0%u0420%u2020+Google%26lr%3D%26rlz%3D1I7SKPT_ru 

解碼以下字符串當我使用URLDecoder.decode()我下面的錯誤

java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u0" 

謝謝, 戴夫

+1

該網址未正確編碼以開始。 – 2011-03-23 16:32:32

+0

@Johan如果它是較大的URL的一部分(如http://foo.com/?url=<上面的字符串),它可能是,但否則,同意 – 2011-03-23 16:35:17

+0

@Johan,爲什麼不呢? @Daniel,完全是我的想法:http://www.google.com/search?q=http%3A//www.google.ru/search%3Fhl%3Dru%26q%3Dla+mer+powder%26btnG%3D% u0420%A0%u0421%u045F%u0420%A0%u0421%U2022%u0420%A0%u0421%u2018%u0420%u040E%u0420%u0453%u0420%A0%u0421%U201D +%u0420%A0%u0420%u2020 +谷歌% 26lr%3D%26rlz%3D1I7SKPT_ru – OscarRyz 2011-03-23 16:35:35

回答

2

根據Wikipedia,「存在Unicode字符的非標準編碼:%uxxxx,其中xxxx是Unicode va略」。 繼續:「此行爲未由任何RFC指定,並且已被W3C拒絕」。

您的URL包含這些標記,並且Java URLDecoder實現不支持這些標記。

2

%uXXXX編碼是非標準的,實際上被W3C拒絕,所以很自然,URLDecoder並不理解它。

您可以製作一個小函數,它將通過在您編碼的字符串中將%uXXYY替換爲%XX%YY來修復它。然後你可以正常地處理和解碼固定字符串。

1

我們從Vartec的解決方案開始,但發現了其他問題。此解決方案適用於UTF-16,但可以更改爲返回UTF-8。所有被留下爲清楚起見替換,你可以閱讀更多的http://www.cogniteam.com/wiki/index.php?title=DecodeEncodeJavaScript

static public String unescape(String escaped) throws UnsupportedEncodingException 
{ 
    // This code is needed so that the UTF-16 won't be malformed 
    String str = escaped.replaceAll("%0", "%u000"); 
    str = str.replaceAll("%1", "%u001"); 
    str = str.replaceAll("%2", "%u002"); 
    str = str.replaceAll("%3", "%u003"); 
    str = str.replaceAll("%4", "%u004"); 
    str = str.replaceAll("%5", "%u005"); 
    str = str.replaceAll("%6", "%u006"); 
    str = str.replaceAll("%7", "%u007"); 
    str = str.replaceAll("%8", "%u008"); 
    str = str.replaceAll("%9", "%u009"); 
    str = str.replaceAll("%A", "%u00A"); 
    str = str.replaceAll("%B", "%u00B"); 
    str = str.replaceAll("%C", "%u00C"); 
    str = str.replaceAll("%D", "%u00D"); 
    str = str.replaceAll("%E", "%u00E"); 
    str = str.replaceAll("%F", "%u00F"); 

    // Here we split the 4 byte to 2 byte, so that decode won't fail 
    String [] arr = str.split("%u"); 
    Vector<String> vec = new Vector<String>(); 
    if(!arr[0].isEmpty()) 
    { 
     vec.add(arr[0]); 
    } 
    for (int i = 1 ; i < arr.length ; i++) { 
     if(!arr[i].isEmpty()) 
     { 
      vec.add("%"+arr[i].substring(0, 2)); 
      vec.add("%"+arr[i].substring(2)); 
     } 
    } 
    str = ""; 
    for (String string : vec) { 
     str += string; 
    } 
    // Here we return the decoded string 
    return URLDecoder.decode(str,"UTF-16"); 
} 
1

後有過在由@ariy提出的解決方案我創建了一個基於Java的解決方案,也是針對具有編碼的字符彈性很好看被分成兩部分(即編碼字符的一半缺失)。這發生在我的用例中,我需要解碼有時在2000字符長度切碎的長URL。請參閱What is the maximum length of a URL in different browsers?

public class Utils { 

    private static Pattern validStandard  = Pattern.compile("%([0-9A-Fa-f]{2})"); 
    private static Pattern choppedStandard = Pattern.compile("%[0-9A-Fa-f]{0,1}$"); 
    private static Pattern validNonStandard = Pattern.compile("%u([0-9A-Fa-f][0-9A-Fa-f])([0-9A-Fa-f][0-9A-Fa-f])"); 
    private static Pattern choppedNonStandard = Pattern.compile("%u[0-9A-Fa-f]{0,3}$"); 

    public static String resilientUrlDecode(String input) { 
     String cookedInput = input; 

     if (cookedInput.indexOf('%') > -1) { 
      // Transform all existing UTF-8 standard into UTF-16 standard. 
      cookedInput = validStandard.matcher(cookedInput).replaceAll("%00%$1"); 

      // Discard chopped encoded char at the end of the line (there is no way to know what it was) 
      cookedInput = choppedStandard.matcher(cookedInput).replaceAll(""); 

      // Handle non standard (rejected by W3C) encoding that is used anyway by some 
      // See: https://stackoverflow.com/a/5408655/114196 
      if (cookedInput.contains("%u")) { 
       // Transform all existing non standard into UTF-16 standard. 
       cookedInput = validNonStandard.matcher(cookedInput).replaceAll("%$1%$2"); 

       // Discard chopped encoded char at the end of the line 
       cookedInput = choppedNonStandard.matcher(cookedInput).replaceAll(""); 
      } 
     } 

     try { 
      return URLDecoder.decode(cookedInput,"UTF-16"); 
     } catch (UnsupportedEncodingException e) { 
      // Will never happen because the encoding is hardcoded 
      return null; 
     } 
    } 
}