我找到了在線代碼,它允許使用Python中的pdfminer
模塊將幾個pdf文件轉換爲文本文件。我試圖擴展我保存在目錄中的幾個pdf文件的代碼,但代碼導致錯誤。用pdfminer轉換幾個文件
我迄今爲止代碼:
import nltk
import re
import glob
from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
def convert(fname, pages=None):
if not pages:
pagenums = set()
else:
pagenums = set(pages)
output = StringIO()
manager = PDFResourceManager()
converter = TextConverter(manager, output, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = file(fname, 'rb')
for page in PDFPage.get_pages(infile, pagenums):
interpreter.process_page(page)
infile.close()
converter.close()
text = output.getvalue()
output.close
with open('D:\Reports\*.txt', 'w') as pdf_file:
pdf_file.write(text)
return text
directory = glob.glob('D:\Reports\*.pdf')
for myfiles in directory:
convert(myfiles)
錯誤消息:
Traceback (most recent call last):
File "F:/Text mining/pdfminer for several files", line 40, in <module>
convert(myfiles)
File "F:/Text mining/pdfminer for several files", line 32, in convert
with open('D:\Reports\*.txt', 'w') as pdf_file:
IOError: [Errno 22] invalid mode ('w') or filename: 'D:\\Reports\\*.txt'
請發佈您正在查看的錯誤消息(如果有) – Erik
@Erik感謝您的評論。我已經擴展了我的答案 – In777
我會嘗試改變'* .txt'在你的寫入部分使用字符串插值('with open('D:\ Reports \ {} .txt'.format(fname),'w') ')或如果失敗則將'w'改爲'wb'。 – Erik