0
pdf文件中的文本是文本格式,未掃描。 PDFMiner不支持python3,有沒有其他解決方案?使用Python3.4 PDF文本提取
pdf文件中的文本是文本格式,未掃描。 PDFMiner不支持python3,有沒有其他解決方案?使用Python3.4 PDF文本提取
還有pdfminer2 fork,支持Python 3.4,可以通過pip3獲得。 https://github.com/metachris/pdfminer
This thread幫我修補一些東西在一起。
from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO
def readPDF(pdfFile):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
device.close()
textstr = retstr.getvalue()
retstr.close()
return textstr
if __name__ == "__main__":
#scrape = open("../warandpeace/chapter1.pdf", 'rb') # for local files
scrape = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") # for external files
pdfFile = BytesIO(scrape.read())
outputString = readPDF(pdfFile)
print(outputString)
pdfFile.close()
什麼參數應該用於導出HTML文件? –
https://github.com/mstamy2/PyPDF2? –
有一個PDFMiner庫的3k版本:https://pypi.python.org/pypi/pdfminer3k –