使用Python3.4 PDF文本提取

pdf文件中的文本是文本格式，未掃描。 PDFMiner不支持python3，有沒有其他解決方案？使用Python3.4 PDF文本提取

2015-06-24 Tom Liu

https://github.com/mstamy2/PyPDF2？ –

有一個PDFMiner庫的3k版本：https://pypi.python.org/pypi/pdfminer3k –

還有pdfminer2 fork，支持Python 3.4，可以通過pip3獲得。 https://github.com/metachris/pdfminer

This thread幫我修補一些東西在一起。

from urllib.request import urlopen 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from pdfminer.pdfpage import PDFPage 
from io import StringIO, BytesIO 

def readPDF(pdfFile): 
    rsrcmgr = PDFResourceManager() 
    retstr = StringIO() 
    codec = 'utf-8' 
    laparams = LAParams() 
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) 

    interpreter = PDFPageInterpreter(rsrcmgr, device) 
    password = "" 
    maxpages = 0 
    caching = True 
    pagenos=set() 
    for page in PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): 
     interpreter.process_page(page) 

    device.close() 
    textstr = retstr.getvalue() 
    retstr.close() 
    return textstr 

if __name__ == "__main__": 
    #scrape = open("../warandpeace/chapter1.pdf", 'rb') # for local files 
    scrape = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") # for external files 
    pdfFile = BytesIO(scrape.read()) 
    outputString = readPDF(pdfFile) 
    print(outputString) 
    pdfFile.close()

來源

2016-02-05 13:30:20 DmcG

什麼參數應該用於導出HTML文件？ –

使用Python3.4 PDF文本提取

回答

相關問題