如何使用Apache PDFBox從PDF文件中提取文本

我想使用Apache PDFBox從給定的PDF文件中提取文本。如何使用Apache PDFBox從PDF文件中提取文本

我寫了這個代碼：

PDFTextStripper pdfStripper = null; 
PDDocument pdDoc = null; 
COSDocument cosDoc = null; 
File file = new File(filepath); 

PDFParser parser = new PDFParser(new FileInputStream(file)); 
parser.parse(); 
cosDoc = parser.getDocument(); 
pdfStripper = new PDFTextStripper(); 
pdDoc = new PDDocument(cosDoc); 
pdfStripper.setStartPage(1); 
pdfStripper.setEndPage(5); 
String parsedText = pdfStripper.getText(pdDoc); 
System.out.println(parsedText);

但是，我得到了以下錯誤：

Exception in thread "main" java.lang.NullPointerException 
at org.apache.fontbox.afm.AFMParser.main(AFMParser.java:304)

我加PDFBOX-1.8.5.jar和fontbox-1.8.5.jar到班級路徑。

編輯

我加System.out.println("program starts");到程序的開始。

我跑了它，然後我得到了上面提到的相同的錯誤，program starts沒有出現在控制檯中。

因此，我認爲我有類路徑或類似的問題。

謝謝。

來源

2014-05-22 Benben

也許您的PDF文本內容的PDF文件中提取數據文件不完全有效，並使PDFBox絆倒。您可能想要提供PDF進行檢查。 – mkl

你確定你開始了正確的'main（）'方法嗎？異常看起來像啓動'org.apache.fontbox.afm.AFMParser'的'main（）'，看起來像PDFBox代碼，而不是你的代碼。 – mkl

你說得對。我重置了運行配置，現在程序正常運行。非常感謝，mkl。 – Benben

我執行了你的代碼，它工作正常。也許你的問題與你提交給文件的FilePath有關。我把我的PDF在C盤和硬編碼文件path.here是我的代碼：

// PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead 
// import org.apache.pdfbox.io.RandomAccessFile; 

public class PDFReader{ 
    public static void main(String args[]) { 
     PDFTextStripper pdfStripper = null; 
     PDDocument pdDoc = null; 
     COSDocument cosDoc = null; 
     File file = new File("C:/my.pdf"); 
     try { 
      // PDFBox 2.0.8 require org.apache.pdfbox.io.RandomAccessRead 
      // RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r"); 
      // PDFParser parser = new PDFParser(randomAccessFile); 

      PDFParser parser = new PDFParser(new FileInputStream(file)); 
      parser.parse(); 
      cosDoc = parser.getDocument(); 
      pdfStripper = new PDFTextStripper(); 
      pdDoc = new PDDocument(cosDoc); 
      pdfStripper.setStartPage(1); 
      pdfStripper.setEndPage(5); 
      String parsedText = pdfStripper.getText(pdDoc); 
      System.out.println(parsedText); 
     } catch (IOException e) { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } 
    } 
}

來源

2014-05-22 18:53:11 Emad

當我們從計算機中獲取pdf文件時，它的工作正常，但我想從android中的SD卡中獲取它，然後它給出錯誤，如「java.lang.ClassNotFoundException：沒有找到類」java.awt.print.Printable「在路徑上：DexPathList [[zip文件「/data/app/com.geeklabs.pdfreader-1/base.apk"],nativeLibraryDirectories=[/vendor/lib,/system/lib]]」 –

並且還得到「java。 lang.NoClassDefFoundError：org.pdfbox.pdmodel.PDDocument「即使添加庫構建路徑 –

PDFbox是如何使用的？我對這個概念很陌生，但不知道從哪裏開始。我已經下載了jar文件，但雙擊它不起作用。 – oivemaria

使用PDFBox 2.0.7，這是我得到一個PDF文本：

static String getText(File pdfFile) throws IOException { 
    PDDocument doc = PDDocument.load(pdfFile); 
    return new PDFTextStripper().getText(doc); 
}

這樣稱呼它：

try { 
    String text = getText(new File("/home/me/test.pdf")); 
    System.out.println("Text in PDF: " + text); 
} catch (IOException e) { 
    e.printStackTrace(); 
}

由於用戶oivemaria在評論中問道：

上依賴管理使用搖籃

dependencies { 
    compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7' 
}

Here's more：

你可以在你的應用程序在build.gradle它添加到你的依賴使用PDFBox的。

如果要在解析文本中保留PDF格式，請嘗試使用PDFLayoutTextStripper。

來源

2016-08-06 17:13:53

這比接受的答案要好。我使用相同的方式獲取資源作爲InputStream從src \ resources'文件夾加載文件。你也可以使用來自m2repo的maven dependency https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox – Lucky

PdfBox 2.0.3也有一個命令行工具。

下載jar文件
java -jar pdfbox-app-2.0.3.jar ExtractText [OPTIONS] <inputfile> [output-text-file]

Options: 
    -password <password>  : Password to decrypt document 
    -encoding <output encoding> : UTF-8 (default) or ISO-8859-1, UTF-16BE, UTF-16LE, etc. 
    -console      : Send text to console instead of file 
    -html      : Output in HTML format instead of raw text 
    -sort      : Sort the text before writing 
    -ignoreBeads     : Disables the separation by beads 
    -debug      : Enables debug output about the time consumption of every stage 
    -startPage <number>   : The first page to start extraction(1 based) 
    -endPage <number>   : The last page to extract(inclusive) 
    <inputfile>     : The PDF document to use 
    [output-text-file]   : The file to write the text to

來源

2016-11-27 14:31:23

這工作得很好，以從具有使用PDFBOX 2.0.6

import java.io.File; 
import java.io.IOException; 
import org.apache.pdfbox.pdmodel.PDDocument; 
import org.apache.pdfbox.text.PDFTextStripper; 
import org.apache.pdfbox.text.PDFTextStripperByArea; 

public class PDFTextExtractor { 
    public static void main(String[] args) throws IOException { 
     System.out.println(readParaFromPDF("C:\\sample1.pdf",3, "Enter Start Text Here", "Enter Ending Text Here")); 
    //Enter FilePath, Page Number, StartsWith, EndsWith 
    } 
    public static String readParaFromPDF(String pdfPath, int pageNo, String strStartIndentifier, String strEndIdentifier) { 
     String returnString = ""; 
     try { 
      PDDocument document = PDDocument.load(new File(pdfPath)); 
      document.getClass();   
      if (!document.isEncrypted()) { 
       PDFTextStripperByArea stripper = new PDFTextStripperByArea(); 
       stripper.setSortByPosition(true); 
       PDFTextStripper tStripper = new PDFTextStripper(); 
       tStripper.setStartPage(pageNo); 
       tStripper.setEndPage(pageNo); 
       String pdfFileInText = tStripper.getText(document); 
       String strStart = strStartIndentifier; 
       String strEnd = strEndIdentifier; 
       int startInddex = pdfFileInText.indexOf(strStart); 
       int endInddex = pdfFileInText.indexOf(strEnd); 
       returnString = pdfFileInText.substring(startInddex, endInddex) + strEnd; 
      } 
      } catch (Exception e) { 
       returnString = "No ParaGraph Found"; 
     } 
      return returnString; 
    } 
}

來源

2017-09-14 05:46:32

如何使用Apache PDFBox從PDF文件中提取文本

回答

相關問題