讀字符串的RandomAccessFile從不同的編碼文件

我有一個很大的文件編碼1250線都只是單獨的拋光話此起彼伏：讀字符串的RandomAccessFile從不同的編碼文件

zając 
dzieło 
kiepsko 
etc

我需要選擇從該文件在隨機10條獨特的線相當快的方式。我這樣做了，但是當我打印這些單詞時，他們編碼錯誤[zaj？c，dzie？o，kiepsko ...]，我需要UTF8。所以我改變了我的代碼，從文件中讀取的字節不只是讀線，所以我的工作結束了這段代碼：

public List<String> getRandomWordsFromDictionary(int number) { 
    List<String> randomWords = new ArrayList<String>(); 
    File file = new File("file.txt"); 
    try { 
     RandomAccessFile raf = new RandomAccessFile(file, "r"); 

     for(int i = 0; i < number; i++) { 
      Random random = new Random(); 
      int startPosition; 
      String word; 
      do { 
       startPosition = random.nextInt((int)raf.length()); 
       raf.seek(startPosition); 
       raf.readLine(); 
       word = grabWordFromDictionary(raf); 
      } while(checkProbability(word)); 
      System.out.println("Word: " + word); 
      randomWords.add(word); 
     } 
    } catch (IOException ioe) { 
     logger.error(ioe.getMessage(), ioe); 
    } 
    return randomWords; 
} 

private String grabWordFromDictionary(RandomAccessFile raf) throws IOException { 
    byte[] wordInBytes = new byte[15]; 
    int counter = 0; 
    byte wordByte; 
    char wordChar; 
    String convertedWord; 
    boolean stop = true; 
    do { 
     wordByte = raf.readByte(); 
     wordChar = (char)wordByte; 
     if(wordChar == '\n' || wordChar == '\r' || wordChar == -1) { 
      stop = false; 
     } else { 
      wordInBytes[counter] = wordByte; 
      counter++; 
     }   
    } while(stop); 
    if(wordInBytes.length > 0) { 
     convertedWord = new String(wordInBytes, "UTF8"); 
     return convertedWord; 
    } else { 
     return null; 
    } 
} 

private boolean checkProbability(String word) { 
    if(word.length() > MAX_LENGTH_LINE) { 
     return true; 
    } else { 
     double randomDouble = new Random().nextDouble(); 
     double probability = (double) MIN_LENGTH_LINE/word.length(); 
     return probability <= randomDouble;   
    } 
}

但什麼是錯的。你能看看這段代碼並幫助我嗎？也許你看到了一些明顯的錯誤，但對我而言並不明顯？我會感謝任何幫助。

來源

2012-12-13 Mariusz Grodek

你的文件在1250，所以你需要在1250解碼它，而不是UTF-8。儘管如此，您可以在解碼過程之後將其另存爲UTF-8。

Charset w1250 = Charset.forName("Windows-1250"); 
convertedWord = new String(wordInBytes, w1250);

來源

2012-12-13 22:04:39 Esailija

但我需要UTF8中的這個詞。有沒有辦法將它們轉換爲UTF8？或者我誤解了你？ –

@MariuszGrodek是什麼讓你這麼想？是的，你需要將它解碼爲1250，因爲它是在1250編碼的。之後，你可以用UTF-8編碼它。使用您的原始代碼正常讀取文件，但這次使用w1250編解碼器而不是UTF-8。 – Esailija

對不起，我檢查了你的代碼，你是絕對正確的！我剛剛誤解了這個問題。非常感謝爲我澄清它。 –

讀字符串的RandomAccessFile從不同的編碼文件

回答

相關問題