如何解碼俄語

我嘗試使用不同的語言內容加載多個網站。只有俄羅斯的內容我已經看到了<?>元素。請幫我把它解碼到正確的符號。我的代碼示例：如何解碼俄語

RequestTask t = new RequestTask(); 
response = t.doIt("http://google.ru"); //troubles 
//response = t.doIt("http://stackoverflow.com"); //ok 
//response = t.doIt("http://web.de/"); //ok 
//response = t.doIt("http://www.china.com/"); // omg, it's ok too! 

StatusLine statusLine = response.getStatusLine(); 

if(statusLine.getStatusCode() == HttpStatus.SC_OK){ 
    ByteArrayOutputStream out = new ByteArrayOutputStream();      
    response.getEntity().writeTo(out); 
    out.close(); 
    String response_string = new String(out.toByteArray(), "UTF-8");

請求代碼：

public class RequestTask { 
    public HttpResponse doIt(String... uri) 
    throws ConnectTimeoutException, UnknownHostException, IOException{ 
     HttpParams params = new BasicHttpParams(); 
     HttpConnectionParams.setConnectionTimeout(params, 6000); 
     HttpConnectionParams.setSoTimeout(params, 6000); 
     HttpClient httpclient = new DefaultHttpClient(params); 
     HttpResponse response = null; 
     Log.d(this.toString(), "HTTP GET to " + uri[0]); 
     response = httpclient.execute(new HttpGet(uri[0])); 
     Log.d(this.toString(), "response: " + response.getStatusLine().getReasonPhrase()); 

     return response; 
    } 
}

來源

2012-11-12 psct

我看不出有任何的煩惱與google.ru：

$ wget google.ru 
[...skipped....] 
$ enca -L ru index.html 
MS-Windows code page 1251 
    LF line terminators

你應該永遠記住，至少有另外3個以上或更少使用的編碼，可以在俄羅斯內容的網頁上找到。除了「UTF-8」之外，我絕對會檢查「KOI-8R」，「WIN-1251」和（不是很受歡迎的）「Mac Cyrillic」。

你可能會使用這樣的事情會更好：

encoding = ("win-1251", "koi8-r") # maybe some others... 

for enc in encoding: 
    try: 
     result = unicode(data, enc) 
     break 
    except: 
     result = "" 
     continue 

if result: 
    print name + "\t: " + enc 
else: 
    print name + "\t: unable to determine the encoding"

來源

2012-11-12 23:18:59 lenik

那麼，什麼是好的practicle辦呢？我如何確定頁面的編碼？現在我知道了兩種工作方法 - ''''''''''''''''''''''現在我知道了兩種工作方法 - 'String response_string = new String（out.toByteArray（），「windows-1251」）;''String response_string = EntityUtils.toString（response.getEntity（），「UTF-8」 '。但是對於使用第一種方法，我需要確定響應的編碼，並且找不到合適的函數。第二種方法是有效的，但將它用於任何編碼是正確的？ – psct

好的做法是你不知道頁面編碼，也沒有可靠的方法來找出答案。有時編碼將在'meta'標籤中指定，有時不會。查看答案更新以獲取更多代碼。 – lenik

但我可以找到像'unicode'這樣的在Java中引發編碼異常的函數？只有我找到的是'URL上的自動檢測編碼'：http://illegalargumentexception.blogspot.ru/2009/05/java-rough-guide-to-character-encoding.html – psct

如何解碼俄語

回答

相關問題