使用PDFBox 2.0從PDF中提取文本

我正在嘗試使用PDFBox 2.0進行文本提取。我想獲得有關特定字符的字體大小和頁面上該字符的位置矩形的信息。我使用PDFTextStripper在PDFBox的1.6實現了這個：使用PDFBox 2.0從PDF中提取文本

PDFParser parser = new PDFParser(is); 
    try{ 
     parser.parse(); 
    }catch(IOException e){ 

    } 
    COSDocument cosDoc = parser.getDocument(); 
    PDDocument pdd = new PDDocument(cosDoc); 
    final StringBuffer extractedText = new StringBuffer(); 
    PDFTextStripper textStripper = new PDFTextStripper(){ 
     @Override 
     protected void processTextPosition(TextPosition text) { 
      extractedText.append(text.getCharacter()); 
      logger.debug("text position: "+text.toString()); 
     } 
    }; 
    textStripper.setSuppressDuplicateOverlappingText(false); 
    for(int pageNum = 0;pageNum<pdd.getNumberOfPages();pageNum++){ 
     PDPage page = (PDPage) pdd.getDocumentCatalog().getAllPages().get(pageNum); 
     textStripper.processStream(page, page.findResources(), page.getContents().getStream()); 
    } 
    pdd.close();

但是在2.0版本PDFBox的中，processStream方法已被刪除。我怎樣才能達到與PDFBox 2.0相同？

我已經試過如下：

 PDDocument pdd = PDDocument.load(inputStream); 
     PDFTextStripper textStripper = new PDFTextStripper(){ 
      @Override 
      protected void processTextPosition(TextPosition text){ 
       int pos = PDFdocument.length(); 
       String textadded = text.getUnicode(); 
       Range range = new Range(pos,pos+textadded.length()); 
       int pagenr = this.getCurrentPageNo(); 
       Rectangle2D rect = new Rectangle2D.Float(text.getX(),text.getY(),text.getWidth(),text.getHeight()); 
      } 
     }; 
     textStripper.setSuppressDuplicateOverlappingText(false); 
     for(int pageNum = 0;pageNum<pdd.getNumberOfPages();pageNum++){ 
      PDPage page = (PDPage) pdd.getDocumentCatalog().getPages().get(pageNum); 
      textStripper.processPage(page); 
     } 
     pdd.close();

的processTextPosition(TextPosition text)方法不會被調用。任何建議將非常受歡迎。

來源

2016-02-29 Dieudonné

P請看源代碼中的DrawPrintTextLocations示例，這就是您顯然想要做的。它使用writeString（）調用。 –

謝謝，那個例子完全是我在找的東西。 –

@tilmanhausherr建議的DrawPrintTextLocations example爲我的問題提供瞭解決方案。

的分析器是使用下面的代碼開始（該inputStream從PDF文件的URL輸入流）：

PDDocument pdd = null; 
    try { 
     pdd = PDDocument.load(inputStream); 
     PDFParserTextStripper stripper = new PDFParserTextStripper(PDFdocument,pdd); 
     stripper.setSortByPosition(true); 
     for (int i=0;i<pdd.getNumberOfPages();i++){ 
      stripper.stripPage(i); 
     } 
    } catch (IOException e) { 
     // throw error 
    } finally { 
     if (pdd!=null) { 
      try { 
       pdd.close(); 
      } catch (IOException e) { 

      } 
     } 
    }

該代碼使用的PDFTextStripper自定義子類：

class PDFParserTextStripper extends PDFTextStripper { 

    public PDFParserTextStripper() throws IOException { 
     super(); 
    } 


    public void stripPage(int pageNr) throws IOException { 
     this.setStartPage(pageNr+1); 
     this.setEndPage(pageNr+1); 
     Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream()); 
     writeText(document,dummy); // This call starts the parsing process and calls writeString repeatedly. 
    } 



    @Override 
    protected void writeString(String string,List<TextPosition> textPositions) throws IOException { 
     for (TextPosition text : textPositions) { 
      System.out.println("String[" + text.getXDirAdj()+","+text.getYDirAdj()+" fs="+text.getFontSizeInPt()+" xscale="+text.getXScale()+" height="+text.getHeightDir()+" space="+text.getWidthOfSpace()+" width="+text.getWidthDirAdj()+" ] "+text.getUnicode()); 
     } 
    } 

}

來源

2016-03-02 09:29:36

這工作得很好，謝謝。爲什麼PDFRenderer＆PDPage對象呢？ – Darajan

@Darajan你是對的。它們可能是早期嘗試的遺留物。我會從答案中刪除它們。 –

@Dieudonné你能指導我嗎？「PDF文檔」課程在哪裏？ –

這是一個使用@tilmanhausherr建議的實現：

import java.io.ByteArrayOutputStream; 
import java.io.File; 
import java.io.FileInputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.OutputStreamWriter; 
import java.io.Writer; 
import java.util.List; 
import org.apache.pdfbox.pdmodel.PDDocument; 
import org.apache.pdfbox.text.PDFTextStripper; 
import org.apache.pdfbox.text.TextPosition; 

class PDFParserTextStripper extends PDFTextStripper 
{ 
    public PDFParserTextStripper(PDDocument pdd) throws IOException 
    { 
     super(); 
     document = pdd; 
    } 

    public void stripPage(int pageNr) throws IOException 
    { 
     this.setStartPage(pageNr+1); 
     this.setEndPage(pageNr+1); 
     Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream()); 
     writeText(document,dummy); // This call starts the parsing process and calls writeString repeatedly. 
    } 

    @Override 
    protected void writeString(String string,List<TextPosition> textPositions) throws IOException 
    { 
     for (TextPosition text : textPositions) { 
      System.out.println("String[" + text.getXDirAdj()+","+text.getYDirAdj()+" fs="+text.getFontSizeInPt()+" xscale="+text.getXScale()+" height="+text.getHeightDir()+" space="+text.getWidthOfSpace()+" width="+text.getWidthDirAdj()+" ] "+text.getUnicode()); 
     } 
    } 

    public static void extractText(InputStream inputStream) 
    { 
     PDDocument pdd = null; 

     try 
     { 
      pdd = PDDocument.load(inputStream); 
      PDFParserTextStripper stripper = new PDFParserTextStripper(pdd); 
      stripper.setSortByPosition(true); 
      for (int i=0; i<pdd.getNumberOfPages(); i++) 
      { 
       stripper.stripPage(i); 
      } 
     } 
     catch (IOException e) 
     { 
      // throw error 
     } 
     finally 
     { 
      if (pdd != null) 
      { 
       try 
       { 
        pdd.close(); 
       } 
       catch (IOException e) 
       { 

       } 
      } 
     } 
    } 

    public static void main(String[] args) throws IOException 
    { 
     File f = new File("C:\\PathToYourPDF\\pdfFile.pdf"); 
     FileInputStream fis = null; 

     try 
     { 
      fis = new FileInputStream(f); 
      extractText(fis); 
     } 
     catch(IOException e) 
     { 
      e.printStackTrace(); 
     } 
     finally 
     { 
      try 
      { 
       if(fis != null) 
        fis.close(); 
      } 
      catch(IOException ex) 
      { 
       ex.printStackTrace(); 
      } 
     } 
    } 
}

來源

2017-05-12 19:19:45 user4332758

使用PDFBox 2.0從PDF中提取文本

回答

相關問題