爲什麼使用DefaultHTTPClient獲取頁面時會出現亂碼文本？

我正在嘗試使用Android的DefaultHTTPClient獲取一個頁面，並使用Jsoup對其進行解析。我收到了一個非常奇怪的迴應，其中<body>和</body>標記中的所有HTML都被編碼爲某種東西。爲什麼使用DefaultHTTPClient獲取頁面時會出現亂碼文本？

<html> 
    <head></head> 
    <body> 
     ��������������Y�#I�&amp;�\�+��*;����/U���53�*��U�=�D�I:I� ����X�����΃��=H��2�`Ѓ ��o��nͽ�C瘹;�l2Y�I_l�����;f��W�k��o2.����?�r&gt;��œ�qYξ&lt;&lt;&lt;;;�g*��ѡl���9&gt;[email protected]��`R��V �c�������Ɂ��e�����,&gt; }���A�����W�?��&quot;.��ˡhޖ�Qy1�oL�_�W�h?9�E?Ofe��KO�Q��(�Av�N�[email protected]��G�qvV�_G��W�g�'q�2�N��L�?�&quot;鳷�x�o�����$9�}/;'#ȸ Q��&amp;�2�\�a��aǔ�L�I�ԯ�=���TPFE� ���:�,�H�N�'QQԯ&lt;&gt;�i}�x��'$�'O ��[email protected]�h 2��ᓃ�CH��ʤO���0�LD)��p8�챺) 
    </body> 
</html>

這是我的方法，提取頁面。

public String doGet(String strUrl, List<NameValuePair> lstParams) throws Exception { 

      String strResponse = null; 
      HttpGet htpGet = new HttpGet(strUrl); 
      //htpGet.addHeader("Accept-Encoding", "gzip, deflate"); 
      htpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1"); 
      DefaultHttpClient dhcClient = new DefaultHttpClient(); 
      PersistentCookieStore pscStore = new PersistentCookieStore(this.objContext); 
      dhcClient.setCookieStore(pscStore); 
      HttpResponse resResponse = dhcClient.execute(htpGet); 
      strResponse = EntityUtils.toString(resResponse.getEntity()); 
      return strResponse; 

    }

爲什麼會發生這種情況？

如果我使用Jsoup本身獲取頁面，則響應很好。我不得不使用Jsoup.connect("http://www.kat.ph/").get()

來源

2012-09-09 Mridang Agarwalla

這是由於該反應是用gzip壓縮。我連接了一個未壓縮響應的自定義響應攔截器。這就是它：

class Decompressor implements HttpResponseInterceptor { 

    /* 
    * @see org.apache.http.HttpResponseInterceptor#process(org.apache.http. 
    * HttpResponse, org.apache.http.protocol.HttpContext) 
    */ 
    public void process(HttpResponse hreResponse, HttpContext hctContext) throws HttpException, IOException { 

     HttpEntity entity = hreResponse.getEntity(); 

     if (entity != null) { 

      Header ceheader = entity.getContentEncoding(); 

      if (ceheader != null) { 

       HeaderElement[] codecs = ceheader.getElements(); 

       for (int i = 0; i < codecs.length; i++) { 

        if (codecs[i].getName().equalsIgnoreCase("gzip")) { 

         hreResponse.setEntity(new HttpEntityWrapper(entity) { 

          @Override 
          public InputStream getContent() throws IOException, IllegalStateException { 

           return new GZIPInputStream(wrappedEntity.getContent()); 

          } 

          @Override 
          public long getContentLength() { 

           return -1; 

          } 

         }); 

         return; 

        } 

       } 

      } 

     } 

    } 

}

來源

2012-09-11 09:35:19

嘗試這種方式....是結果相同.....

URL url = new URL("Your_URL"); 

InputStream is = url.openStream(); // or url.openConnection(); 

Scanner scan = new Scanner(is); 

while(scan.hasNextLine()){ 

System.out.println(scan.nextLine()); 


} 

}

來源

2012-09-09 15:11:12

爲什麼使用DefaultHTTPClient獲取頁面時會出現亂碼文本？

回答

相關問題