2011-05-14 33 views
1

您好,我在優化burrows wheeler transform時遇到了一些困難。我試圖轉換文本文件,但是轉換像聖經這樣的大型文本文件太長了。Burrows Wheeler變換的優化

關於如何進行的任何想法?

public BurrowsWheelerTransformEncoder() 
{ 

} 

private String originalSuffix(int index, String string) 
{ 
    String temp = (string.substring(index,string.length()) + string.substring(0,index)); 

    //this bit just 'compresses' each transformation of text by producing 
    //a prefix, so 'abracadabra' just becomes 'abrac' 
    //this is so minimal amount of memory is used when it is stored in an array 

    return temp.substring(0,5)+ 
    //the last character of the transformation is kept 
      temp.charAt(temp.length()-1); 
} 

private String compressedSuffix(String string) 
{ 
    //this method just 'compresses' original piece of text by producing 
    //a prefix, so 'abracadabra' just becomes 'abrac' 
    //this is so comprisons won't take so long 
    return string.substring(0,5)+string.charAt(string.length()-1); 
} 

public static void main(String args[]) throws Exception 
{ 
    BurrowsWheelerTransformEncoder encoder = new BurrowsWheelerTransformEncoder(); 
    BufferedReader input = new BufferedReader(new FileReader("src/compressionalgorithm/texts/manifesto.txt")); 

    String text = ""; 
    //the row in the sorted array where the original text can be found 
    int originalRow = 0; 
    //system time when program began 
    long startTime = System.nanoTime(); 

    //get text from file 
    while(input.ready()) 
    { 
     text += input.readLine(); 
    } 
    //create a new array to hold all transformations 
    String[] textArray = new String[text.length()]; 
    int length = text.length(); 

    //get individual transformations and put in array 
    for(int i = 0; i < text.length(); i++) 
    { 
     textArray[i] = encoder.originalSuffix(i,text); 
     //for debugging large text files, prints progress after every 10k'th 
     //transformation 
     if(i%10000==0) 
     System.out.println(i+"/"+length); 
    } 
    //uses java's internal methods to sort the array, presumably 
    //the most efficient way to do the sort (for now) 
    Arrays.sort(textArray); 

    String compressedOriginalText = encoder.compressedSuffix(text); 

    //print the results 
    for(int i = 0; i < textArray.length; i++) 
    { 
     if(textArray[i].equals(compressedOriginalText)) 
     { 
      originalRow = i; 
     } 
     if(i%100==0) 
     { 
      System.out.println(); 
     } 
     System.out.print(textArray[i].charAt(textArray[i].length()-1)); 
    } 
    System.out.println("\nThe original transformation of the text was found at row " + originalRow + " of the sorted array."); 
    System.out.println("Time elapsed: " + (System.nanoTime() - startTime)); 
} 

回答

1

這條線:

String temp = (string.substring(index,string.length()) + string.substring(0,index)); 

是要在每次調用時創建整個輸入文本的副本。由於您對N個字符的輸入文本進行了N次調用,因此您的算法將爲O(N^2)

看看你是否可以優化originalSuffix方法來避免複製。

+0

複製是必要的,以便創建的轉換的有序數組 – nope 2011-05-14 07:38:43

+0

不,它不是。或者如果是這樣,那麼您的方法實施就會失敗。該方法創建並返回一個長度爲6個字符的字符串,但正在複製整個輸入字符串。 – 2011-05-14 09:32:23

3

對於編碼案例,您不需要實際構建字符串數組 - 使用int(或long取決於文件大小)數組來存儲旋轉字符串開始的索引。

  • 創建初始化爲[0 1 2 3 ... N]
  • 用下面的compareTo陣列(假定compareTo()訪問原始字符串,original)分類的數組:

    int compareTo(int a, int b){ 
        int compare, len = original.length(); 
        do{ 
         char _a = original.charAt(a), _b = original.charAt(b); 
         compare = _a-_b; 
         a++; b++; 
         if(a>=len)a-=len; 
         if(b>=len)b-=len; 
        }while(compare==0); 
        return compare; 
    } 
    
  • 記爲「0」的數組中的索引,並添加到您的輸出爲「開始」值

對於逆轉,我們再次希望避免爲與聖經一樣大的文本構建整個表格。我們可以通過使用第一行和最後一行中相同的標記始終處於相同順序這一事實來做到這一點。這是真實的,因爲第一行是排序的,令牌是循環排列的:對於最後一行中的三個連續的b,它們之後的令牌被排序,所以b被排序。如此反轉:

  • 排序輸出令牌。隨着存儲排序的令牌,存儲每個令牌開始的索引。因此,對於未分類的令牌「nbnaaa」,您可以存儲[3 4 5 2 0 1]和「aaabnn」。 重要:這一步你必須使用穩定的排序。
  • 使用前面提到的重建串「開始」值:

    string decode(string sorted, int[]index, int start){ 
        string answer = ""+sorted.charAt(start); 
        int next = index[start]; 
        while(next!=start){ 
         answer = sorted.charAt(next) + answer; 
         next = index[next]; 
        } 
        return answer; 
    }