2014-02-11 43 views
2

是否有方法使用PDFBox獲取PDF文件的每一行的字體?我已經嘗試過,但它只列出了該頁面中使用的所有字體。它不顯示該字體顯示的行或文本。使用PDFBox獲取每行的字體

List<PDPage> pages = doc.getDocumentCatalog().getAllPages(); 
for(PDPage page:pages) 
{ 
Map<String,PDFont> pageFonts=page.getResources().getFonts(); 
for(String key : pageFonts.keySet()) 
    { 
    System.out.println(key+" - "+pageFonts.get(key)); 
    System.out.println(pageFonts.get(key).getBaseFont()); 
    } 
} 

任何輸入表示讚賞。謝謝!

回答

11

每當您嘗試使用PDFBox從PDF中提取文本(純文本或樣式信息)時,通常應該開始嘗試使用PDFTextStripper類或其親屬之一。這門課已經完成了PDF內容解析中涉及的所有繁重工作。

您使用純PDFTextStripper類是這樣的:

PDDocument document = ...; 
PDFTextStripper stripper = new PDFTextStripper(); 
// set stripper start and end page or bookmark attributes unless you want all the text 
String text = stripper.getText(document); 

這僅僅是返回純文本,例如一些R40形式:

Claim for repayment of tax deducted 
from savings and investments 
How to fill in this form 
Please fill in this form with details of your income for the 
above tax year. The enclosed Notes will help you (but there is 
not a note for every box on the form). If you need more help 
with anything on this form, please phone us on the number 
shown above. 
If you are not a UK resident, do not use this form – please 
contact us. 
Please do not send us any personal records, or tax 
certificates or vouchers with your form. We will contact 
you if we need these. 
Please allow four weeks before contacting us about your 
repayment. We will pay you as quickly as possible. 
Use black ink and capital letters 
Cross out any mistakes and write the 
correct information below 
... 

可以,在另一方面,覆蓋其方法writeString(String, List<TextPosition>)和流程比單純的文本的詳細信息。要添加上使用的字體無論字體變化的名稱信息,您可以使用此:

PDFTextStripper stripper = new PDFTextStripper() { 
    String prevBaseFont = ""; 

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException 
    { 
     StringBuilder builder = new StringBuilder(); 

     for (TextPosition position : textPositions) 
     { 
      String baseFont = position.getFont().getBaseFont(); 
      if (baseFont != null && !baseFont.equals(prevBaseFont)) 
      { 
       builder.append('[').append(baseFont).append(']'); 
       prevBaseFont = baseFont; 
      } 
      builder.append(position.getCharacter()); 
     } 

     writeString(builder.toString()); 
    } 
}; 

出於同樣的形式,你

[DHSLTQ+IRModena-Bold]Claim for repayment of tax deducted 
from savings and investments 
How to fill in this form 
[OIALXD+IRModena-Regular]Please fill in this form with details of your income for the 
above tax year. The enclosed Notes will help you (but there is 
not a note for every box on the form). If you need more help 
with anything on this form, please phone us on the number 
shown above. 
If you are not a UK resident, do not use this form – please 
contact us. 
[DHSLTQ+IRModena-Bold]Please do not send us any personal records, or tax 
certificates or vouchers with your form. We will contact 
you if we need these. 
[OIALXD+IRModena-Regular]Please allow four weeks before contacting us about your 
repayment. We will pay you as quickly as possible. 
Use black ink and capital letters 
Cross out any mistakes and write the 
correct information below 
... 

如果你不想要與文本合併的字體信息,只需在您的方法覆蓋中創建單獨的結構。

TextPosition提供了更多關於它代表的文本的信息。檢查它!