閱讀在python

我怎麼能讀到這樣標題，作者屬性/元數據，主題和關鍵詞存儲上使用python PDF文件的PDF屬性/元數據？閱讀在python

2013-01-08 Khaleel

from pdfminer.pdfparser import PDFParser 
from pdfminer.pdfdocument import PDFDocument 

fp = open('diveintopython.pdf', 'rb') 
parser = PDFParser(fp) 
doc = PDFDocument(parser) 

print doc.info # The "Info" metadata

下面是輸出：

>>> [{'CreationDate': 'D:20040520151901-0500', 
    'Creator': 'DocBook XSL Stylesheets V1.52.2', 
    'Keywords': 'Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free', 
    'Producer': 'htmldoc 1.8.23 Copyright 1997-2002 Easy Software Products, All Rights Reserved.', 
    'Title': 'Dive Into Python'}]

欲瞭解更多信息，看看這個教程：A lightweight XMP parser for extracting PDF metadata in Python。

來源

2013-01-08 06:22:11 namit

擡頭：pdfminer的作者說，這是不符合Python 3中，至少在這個帖子的日期（[鏈接] （https://github.com/euske/pdfminer/）） – JSmyth

正如2013年11月的，將「PDFDocument類現在需要一個PDFParser對象作爲參數。PDFDocument.set_parser（）和PDFParser.set_document（）被除去。」所以你可以做doc = PDFDocument（解析器），並跳過對set_document，set_parser和初始化的調用。 –

@JSmyth的[PyPI將索引]（https://pypi.python.org/pypi?%3Aaction=search&term=pdfminer&submit=search）目前列出了三個工作'pdfminer'叉是與Python 3'PIP搜索pdfminer'兼容 – zero2cx

我已經實現這個使用pyPdf。請參閱下面的示例代碼。

from pyPdf import PdfFileReader 
pdf_toread = PdfFileReader(open("doc2.pdf", "rb")) 
pdf_info = pdf_toread.getDocumentInfo() 
print str(pdf_info)

輸出：

{'/Title': u'Microsoft Word - Agnico-Eagle - Complaint (00040197-2)', '/CreationDate': u"D:20111108111228-05'00'", '/Producer': u'Acrobat Distiller 10.0.0 (Windows)', '/ModDate': u"D:20111108112409-05'00'", '/Creator': u'PScript5.dll Version 5.2.2', '/Author': u'LdelPino'}

注：pyPdf homepage說，它不再保留。

來源

2013-01-08 08:49:01 Khaleel

不要使用'file'的叉子，用'open'代替。 –

請注意，pyPdf在主頁上被標記爲不再受支持。 –

對於Python 3看到PyPDF2與來自@Khaleel示例代碼更新爲：

from PyPDF2 import PdfFileReader 
pdf_toread = PdfFileReader(open("test.pdf", "rb")) 
pdf_info = pdf_toread.getDocumentInfo() 
print(str(pdf_info))

安裝使用pip install PyPDF2。

來源

2016-10-08 11:31:14

對於Python 3和新pdfminer（PIP安裝pdfminer3k）：

來源

2016-12-19 01:36:11 Rabash

回答

相關問題