霍夫曼編碼 - 處理unicode

我在java中實現了一個霍夫曼編碼，它處理來自輸入文件的字節數據。但是，它只適用於壓縮ascii。我想擴展它，以便它可以處理大於1個字節的字符，但我不確定如何完全做到這一點。霍夫曼編碼 - 處理unicode

private static final int CHARS = 256;  
private int [] getByteFrequency(File f) throws FileNotFoundException { 
    try { 
     FileInputStream fis = new FileInputStream(f); 
     byte [] bb = new byte[(int) f.length()]; 
     int [] aa = new int[CHARS]; 
      if(fis.read(bb) == bb.length) { 
       System.out.print("Uncompressed data: "); 
       for(int i = 0; i < bb.length; i++) { 
         System.out.print((char) bb[i]); 
         aa[bb[i]]++; 
       } 
       System.out.println(); 
      } 
     return aa; 
    } catch (FileNotFoundException e) { throw new FileNotFoundException(); 
    } catch (IOException e) { e.printStackTrace(); } 
    return null; 
}

例如，這是我用來獲取文件中字符的頻率，顯然它只能在單個字節上工作。如果我給它一個unicode文件，我得到一個ArrayIndexOutOfBoundsException在aa[bb[i]]++;，我通常是一個負數。我知道這是因爲aa[bb[i]]++;只能看一個字節，並且unicode字符將會不止一個字節，但我不知道如何改變它。

有人可以給我一些指針嗎？

來源

2012-11-01 Kumalh

爲什麼把它當作unicode而不是字節數組？ –

@JeffFerland：如果你看他的代碼 - 他使用它作爲一個字節數組，他只是落入「簽名字節」坑。 – DThought

嘗試以下操作：

private static final int CHARS = 256;  
private int [] getByteFrequency(File f) throws FileNotFoundException { 
    try { 
     FileInputStream fis = new FileInputStream(f); 
     byte [] bb = new byte[(int) f.length()]; 
     int [] aa = new int[CHARS]; 
      if(fis.read(bb) == bb.length) { 
       System.out.print("Uncompressed data: "); 
       for(int i = 0; i < bb.length; i++) { 
         System.out.print((char) bb[i]); 
         aa[((int)bb[i])&0xff]++; 
       } 
       System.out.println(); 
      } 
     return aa; 
    } catch (FileNotFoundException e) { throw new FileNotFoundException(); 
    } catch (IOException e) { e.printStackTrace(); } 
    return null; 
}

如果我是正確的（我沒有測試它），你的問題是字節是Java中的符號值。轉換爲整數+將其掩碼爲0xff應該正確處理它。

來源

2012-11-01 09:42:05 DThought

謝謝！添加後我不再收到負值問題。但是，讀取字符似乎存在問題。這可能歸因於我對字符編碼缺乏瞭解，但是當我給它一個包含'āčôęłüß'的文件時，它將其解釋爲'ㅑチㅑヘㅐ' ㅄㅑルㅒツㅐㅌㅐ﾿ㅐ゚」。然後，當我完成字符到代碼的映射時，它會得到一堆其他符號（例如，¼，¿等等）。沒有一個在輸入文件中。 – Kumalh

@Kumalh嗯，你正在處理文件作爲一個字節數組，所以編碼應該是不相關的。也許在其他地方你需要做相同的轉換來正確處理字節值？順便說一句，你可以優化代碼不讀取完整的文件到內存中，但只處理塊來建立頻率圖。 – DThought

難道一些字符需要多個字節嗎？而且我的程序一次只能讀取一個字節，從而誤解了它？至於優化，我可能會考慮在稍後階段，因爲現在我更關心的是讓它工作！感謝你的想法，但我會研究。 – Kumalh

霍夫曼編碼 - 處理unicode

回答

相關問題