如何轉換俄語西裏爾字母的字符串？

String artist - 我不知道什麼是對編碼

Ïåñíÿ ïðî íàäåæäó - 在俄羅斯"Песня про надежду"

例如字符串我用http://code.google.com/p/juniversalchardet/

代碼：

String GetEncoding(String text) throws IOException { 
     byte[] buf = new byte[4096]; 


     InputStream fis = new ByteArrayInputStream(text.getBytes()); 


     UniversalDetector detector = new UniversalDetector(null); 

     int nread; 
     while ((nread = fis.read(buf)) > 0 && !detector.isDone()) { 
      detector.handleData(buf, 0, nread); 
     } 
     detector.dataEnd(); 
     String encoding = detector.getDetectedCharset(); 
     detector.reset(); 
     return encoding; 
    }

和隱蔽

new String(text.getBytes(encoding), "cp1251"); - 但這不行。

如果我使用UTF-16

new String(text.getBytes("UTF-16"), "cp1251")回報「юяПесняпронадежду」空間 - 不爲CHAR空間

編輯：

這個第一讀字節

byte[] abyFrameData = new byte[iTagSize]; 
oID3DIS.readFully(abyFrameData); 
ByteArrayInputStream oFrameBAIS = new ByteArrayInputStream(abyFrameData);

的String =新字符串（abyFrameData，「????」）;

來源

2011-05-16 Mediator

你是如何得到的字符串文本參數？或許這個問題與你如何創建探測器的輸入有關。 java字符串總是UTF-16，所以這裏已經有一些字符轉換了。 – stevevls 2011-05-16 12:06:37

'new String（text.getBytes（「UTF-16」），「cp1251」）'不會做你認爲它做的事。它實際上做的是取一個現有的字符串，檢索它的字節爲UTF-16，然後嘗試通過假設這些字節字節是CP1251來創建一個新字符串。這是保證是錯誤的。 – Anon 2011-05-16 12:12:39

@ stevevls，嗯java字符串總是UTF-16，而不是Unicode http://download.oracle.com/javase/tutorial/i18n/text/index.html – mKorbel 2011-05-16 12:15:16

Java字符串是UTF-16。所有其他編碼可以使用字節序列表示。要解碼字符數據，您必須在首次創建字符串時提供編碼。如果你有一個損壞的字符串，它已經太晚了。

假設ID3，規範定義了編碼規則。例如，ID3v2.4.0可能限制通過的擴展報頭中使用的編碼：

q - 文本編碼限制

0 No restrictions 
    1 Strings are only encoded with ISO-8859-1 [ISO-8859-1] or 
     UTF-8 [UTF-8].

編碼處理被進一步限定向下文檔：

如果沒有別的說法，字符串包括數字字符串和URL，表示爲ISO-8859-1 範圍爲$ 20 - $ FF的字符。這樣的字符串在框中表示爲<text string>或 <full text string>如果換行符是允許的。如果沒有其他說換行符被禁止。在 ISO-8859-1中，表示換行符，允許時只有$ 0A。

允許不同類型的文本編碼的幀包含文本編碼描述字節。可能的編碼：
$00 ISO-8859-1 [ISO-8859-1]. Terminated with $00. 
$01 UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All 
     strings in the same frame SHALL have the same byteorder. 
     Terminated with $00 00. 
$02 UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM. 
     Terminated with $00 00. 
$03 UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with 
     $00. 

使用轉碼類，如InputStreamReader或（在這種情況下更可能）的String(byte[],Charset)構造的數據進行解碼。另見Java: a rough guide to character encoding。

解析ID3v2.4.0數據結構的字符串組成部分將是這樣的：

//untested code 
public String parseID3String(DataInputStream in) throws IOException { 
    String[] encodings = { "ISO-8859-1", "UTF-16", "UTF-16BE", "UTF-8" }; 
    String encoding = encodings[in.read()]; 
    byte[] terminator = 
     encoding.startsWith("UTF-16") ? new byte[2] : new byte[1]; 
    byte[] buf = terminator.clone(); 
    ByteArrayOutputStream buffer = new ByteArrayOutputStream(); 
    do { 
    in.readFully(buf); 
    buffer.write(buf); 
    } while (!Arrays.equals(terminator, buf)); 
    return new String(buffer.toByteArray(), encoding); 
}

來源

2011-05-16 13:03:50 McDowell

我讀過這個......但不明白。我編輯我的帖子。 – Mediator 2011-05-16 15:14:03

這是爲我工作：

byte[] bytes = s.getBytes("ISO-8859-1"); 
UniversalDetector encDetector = new UniversalDetector(null); 
encDetector.handleData(bytes, 0, bytes.length); 
encDetector.dataEnd(); 
String encoding = encDetector.getDetectedCharset(); 
if (encoding != null) s = new String(bytes, encoding);

來源

2014-05-07 06:11:55 Nik

如何轉換俄語西裏爾字母的字符串？

回答

相關問題