選擇從巨大的文本文件

隨機行我有非常巨大的文本文件，1800萬線4Gbyte，我想挑選從它的一些隨機的線條，我寫了下面的代碼要做到這一點，但它是緩慢選擇從巨大的文本文件

import java.io.BufferedWriter; 
import java.io.IOException; 
import java.nio.charset.Charset; 
import java.nio.file.Files; 
import java.nio.file.Paths; 
import java.util.Arrays; 
import java.util.Collections; 
import java.util.List; 
import java.util.Random; 
import java.util.stream.Collectors; 
import java.util.stream.Stream; 
public class Main { 

    public static void main(String[] args) throws IOException { 
     int sampleSize =3000; 
     int fileSize = 18000000; 
     int[] linesNumber = new int[sampleSize]; 
     Random r = new Random(); 
     for (int i = 0; i < linesNumber.length; i++) { 
      linesNumber[i] = r.nextInt(fileSize); 

     } 
     List<Integer> list = Arrays.stream(linesNumber).boxed().collect(Collectors.toList()); 
     Collections.sort(list); 

     BufferedWriter outputWriter = Files.newBufferedWriter(Paths.get("output.txt")); 

     for (int i : list) { 

      try (Stream<String> lines = Files.lines(Paths.get("huge_text_file"))) { 
       String en=enlines.skip(i-1).findFirst().get(); 

       outputWriter.write(en+"\n"); 
       lines.close(); 

      } catch (Exception e) { 
       System.err.println(e); 

      } 

     } 
     outputWriter.close(); 


    } 
}

有沒有更優雅更快的方法來做到這一點？謝謝。

來源

2017-09-08 Anas Bari

這可能是一個代碼審查類型的問題 - 我真的不知道。 –

如果這段代碼工作正常，那麼這個問題就是堆棧溢出問題，但可能對我們的姊妹站點[代碼評論]（https://codereview.stackexchange.com/）很有幫助。 –

有幾件事情，我覺得麻煩你當前的代碼。

您正在載入整個文件到RAM。我對你的示例文件不太瞭解，但是我使用的文件會導致我的默認JVM崩潰。
你正在一遍又一遍跳過相同的線條，對於早期的線條更是如此 - 這是非常低效的，就像O（n^n）之類的東西。如果你能用這種方法處理一個500MB的文件，我會感到驚訝。

這就是我想出了：

public static void main(String[] args) throws IOException { 
    int sampleSize = 3000; 
    int fileSize = 50000; 
    int[] linesNumber = new int[sampleSize]; 
    Random r = new Random(); 
    for (int i = 0; i < linesNumber.length; i++) { 
     linesNumber[i] = r.nextInt(fileSize); 

    } 
    List<Integer> list = Arrays.stream(linesNumber).boxed().collect(Collectors.toList()); 
    Collections.sort(list); 

    BufferedWriter outputWriter = Files.newBufferedWriter(Paths.get("localOutput/output.txt")); 
    long t1 = System.currentTimeMillis(); 
    try(BufferedReader reader = new BufferedReader(new FileReader("extremely large file.txt"))) 
    { 
     int index = 0;//keep track of what item we're on in the list 
     int currentIndex = 0;//keep track of what line we're on in the input file 
     while(index < sampleSize)//while we still haven't finished the list 
     { 
      if(currentIndex == list.get(index))//if we reach a line 
      { 
       outputWriter.write(reader.readLine()); 
       outputWriter.write("\n");//readLine doesn't include the newline characters 
       while(index < sampleSize && list.get(index) <= currentIndex)//have to put this here in case of duplicates in the list 
        index++; 
      } 
      else 
       reader.readLine();//readLine is dang fast. There may be faster ways to skip a line, but this is still plenty fast. 
      currentIndex++; 
     } 
    } catch (Exception e) { 
     System.err.println(e); 
    } 
    outputWriter.close(); 
    System.out.println(String.format("Took %d milliseconds", System.currentTimeMillis() - t1)); 
}

這大約需要87毫秒，我就爲30的樣本大小和文件大小的50000運行的4.7GB文件，並把〜91毫秒，當我將樣本大小更改爲3000.當我將文件大小增加到10,000時，花費了122毫秒。 Tl;博士對於這一段=它的尺度非常好，並且在更大的樣本尺寸下可以很好地縮放。

直接回答你的問題「是否有更優雅的快速方法來做到這一點？」就在這裏。 更快的方法是自己跳過線條，不要將整個文件加載到內存中，並確保繼續使用緩衝讀寫器。此外，我會避免嘗試做你自己的原始數組緩衝區或類似的東西 - 只是不要。

如果您想了解更多關於它的工作原理，請隨意瀏覽我已包含的方法。

來源

2017-09-08 22:39:53 Jeutnarg

我首先想到的方法是查看Java cf中的RandomAccess文件。 https://docs.oracle.com/javase/tutorial/essential/io/rafs.html。通常，隨機查找比讀取整個文件要快很多，但是您需要逐字節地讀取以讀取下一行的開始（例如），然後逐字節地讀取該行到下一個換行，然後尋找另一個隨機位置。

我不確定這種方法會更優雅（部分取決於你如何編碼我猜），但我希望它會更快。

來源

2017-09-08 22:07:05

有沒有有效的方法來尋求線路。唯一我能想到的就是使用RandomAccessFile，尋找隨機位置，然後將下一個200（？）字符讀入數組。然後進行換行查找並形成一個字符串。

doc

來源

2017-09-08 22:21:26

選擇從巨大的文本文件

回答

相關問題