2016-11-28 53 views
0

我找到了在線代碼,它允許使用Python中的pdfminer模塊將幾個pdf文件轉換爲文本文件。我試圖擴展我保存在目錄中的幾個pdf文件的代碼,但代碼導致錯誤。用pdfminer轉換幾個文件

我迄今爲止代碼:

import nltk 
import re 
import glob 

from cStringIO import StringIO 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from pdfminer.pdfpage import PDFPage 

def convert(fname, pages=None): 
    if not pages: 
     pagenums = set() 
    else: 
     pagenums = set(pages) 

    output = StringIO() 
    manager = PDFResourceManager() 
    converter = TextConverter(manager, output, laparams=LAParams()) 
    interpreter = PDFPageInterpreter(manager, converter) 

    infile = file(fname, 'rb') 
    for page in PDFPage.get_pages(infile, pagenums): 
     interpreter.process_page(page) 
    infile.close() 
    converter.close() 
    text = output.getvalue() 
    output.close 

    with open('D:\Reports\*.txt', 'w') as pdf_file: 
     pdf_file.write(text) 

    return text 

directory = glob.glob('D:\Reports\*.pdf') 

for myfiles in directory: 
    convert(myfiles) 

錯誤消息:

Traceback (most recent call last): 
    File "F:/Text mining/pdfminer for several files", line 40, in <module> 
    convert(myfiles) 
    File "F:/Text mining/pdfminer for several files", line 32, in convert 
    with open('D:\Reports\*.txt', 'w') as pdf_file: 
IOError: [Errno 22] invalid mode ('w') or filename: 'D:\\Reports\\*.txt' 
+0

請發佈您正在查看的錯誤消息(如果有) – Erik

+0

@Erik感謝您的評論。我已經擴展了我的答案 – In777

+0

我會嘗試改變'* .txt'在你的寫入部分使用字符串插值('with open('D:\ Reports \ {} .txt'.format(fname),'w') ')或如果失敗則將'w'改爲'wb'。 – Erik

回答

1

錯誤源於試圖將text變量的內容寫到一個名爲'D:\Reports\*.txt'文件。不允許在文件名中使用通配符*ref)。

如果你想將文件保存到具有相同名稱的文本文件,你可以取代你的寫作功能:

outfile = os.path.splitext(os.path.abspath(fname))[0] + '.txt' 
    with open(outfile, 'wb') as pdf_file: 
     pdf_file.write(text) 

不要忘了import os如果要處理在OS無關的路徑辦法。

+0

謝謝!它工作完美! – In777

0

或許你應該改變:

with open('D:\Reports\*.txt', 'w') as pdf_file: 
    pdf_file.write(text) 

with open(fname, 'w') as pdf_file: 
    pdf_file.write(text) 

但我沒有獲得我的機器上python2.7-3.4驗證