Apache PDFBOX - 獲取java.lang.OutOfMemoryError使用拆分（PDDocument文檔）

我想用一個體面的300頁使用Apache PDFBOX API V2.0.2拆分文檔。在嘗試使用下面的代碼，以分割pdf文檔單頁：Apache PDFBOX - 獲取java.lang.OutOfMemoryError使用拆分（PDDocument文檔）

 PDDocument document = PDDocument.load(inputFile); 
     Splitter splitter = new Splitter(); 
     List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here

我收到以下異常

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded

這表明GC是需要很長時間來清除堆是沒有理由回收金額。

有許多JVM調優方法可以解決這種情況，但是，所有這些只是治療症狀而不是真正的問題。

最後一個音符，我使用JDK6，因此，使用新的Java 8消費者是不是在我case.Thanks選項

編輯：

這不是HTTP的重複問題： //sackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as：

 
1. I do not have the size problem mentioned in the aforementioned 
    topic. I am slicing a 270 pages 13.8MB PDF file and after slicing 
    the size of each slice is an average of 80KB with total size of 
    30.7MB. 
2. The Split throws the exception even before it returns the splitted parts.

我發現split可以通過只要我沒有通過整個文件，而是將其作爲「批次」傳遞，每個文件20-30頁，完成這項工作。

來源

2016-07-04 WiredCoder

已知的錯誤，使用2.0.1直到此是固定的。 –

您是否嘗試過Tilman建議的以前的版本？ –

我對版本號有限制@GeorgeGarchagudashvili – WiredCoder

PDF盒存儲部分是由於拆分操作堆中的對象，這會導致堆型PDDocument的對象越來越充滿快，即使你調用在每一輪後的close（）操作循環，GC仍然無法以與填充相同的方式回收堆大小。

一種選擇是分裂文件分割操作，以批次，其中每個批次是一個相對管理塊（10〜40頁）中2.0.2

public void execute() { 
    File inputFile = new File(path/to/the/file.pdf); 
    PDDocument document = null; 
    try { 
     document = PDDocument.load(inputFile); 

     int start = 1; 
     int end = 1; 
     int batchSize = 50; 
     int finalBatchSize = document.getNumberOfPages() % batchSize; 
     int noOfBatches = document.getNumberOfPages()/batchSize; 
     for (int i = 1; i <= noOfBatches; i++) { 
      start = end; 
      end = start + batchSize; 
      System.out.println("Batch: " + i + " start: " + start + " end: " + end); 
      split(document, start, end); 
     } 
     // handling the remaining 
     start = end; 
     end += finalBatchSize; 
     System.out.println("Final Batch start: " + start + " end: " + end); 
     split(document, start, end); 

    } catch (IOException e) { 
     e.printStackTrace(); 
    } finally { 
     //close the document 
    } 
} 

private void split(PDDocument document, int start, int end) throws IOException { 
    List<File> fileList = new ArrayList<File>(); 
    Splitter splitter = new Splitter(); 
    splitter.setStartPage(start); 
    splitter.setEndPage(end); 
    List<PDDocument> splittedDocuments = splitter.split(document); 
    String outputPath = Config.INSTANCE.getProperty("outputPath"); 
    PDFTextStripper stripper = new PDFTextStripper(); 

    for (int index = 0; index < splittedDocuments.size(); index++) { 
     String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf"; 
     PDDocument splittedDocument = splittedDocuments.get(index); 

     splittedDocument.save(pdfFullPath); 
    } 
}

來源

2016-07-10 17:23:28 WiredCoder

Apache PDFBOX - 獲取java.lang.OutOfMemoryError使用拆分（PDDocument文檔）

回答

相關問題