我跟了周圍的幾個教程,但我不能得到這個代碼塊的運行,我的確從StringIO的必要切換到BytesIO(我相信嗎?)Pdfminer蟒蛇3.5
我不確定爲什麼「香蕉'沒有印刷任何東西,我認爲這些錯誤可能是紅鯡魚?是不是跟着一個python2.7教程並試圖將它翻譯成python3?
errors: File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 28, in <module>
banana = convert("A1.pdf")
File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 19, in convert
infile = file(fname, 'rb')
NameError: name 'file' is not defined
腳本
from io import BytesIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = BytesIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
return text
banana = convert("A1.pdf")
print(banana)
同樣的事情發生這種變異:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = BytesIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
Banana = convert_pdf_to_txt("A1.pdf")
print(Banana)
我試圖尋找這個(大部分pdfminer代碼是從this或this),但有沒有運氣。
任何洞察力是讚賞。
乾杯
請確認由要麼upvoting或接受我的答案 – animal