2017-07-28 154 views
0

我從以前的SO問題中提取了一些Python代碼,但代碼是爲以前版本的PDFMiner編寫的(而且它似乎對PDFMiner有一些重大更改) 。我已經做了一些更改,以解決這些錯誤,但現在我發現了以下錯誤:PDFMiner版本差異?獲取AttributeError:'PDFDocument'對象沒有屬性'seek'

C:\Users\xxxx\Documents\Programming\Python>pdfextractor.py 
Traceback (most recent call last): 
    File "C:\Users\xxxx\Documents\Programming\Python\pdfextractor.py", line 71, in <module> 
    pdf_to_csv(sourcefile) 
    File "C:\Users\xxxx\Documents\Programming\Python\pdfextractor.py", line 55, in pdf_to_csv 
    for i, page in PDFPage.get_pages(doc): 
    File "C:\Program Files\Python27\lib\site-packages\pdfminer\pdfpage.py", line 119, in get_pages 
    parser = PDFParser(fp) 
    File "C:\Program Files\Python27\lib\site-packages\pdfminer\pdfparser.py", line 43, in __init__ 
    PSStackParser.__init__(self, fp) 
    File "C:\Program Files\Python27\lib\site-packages\pdfminer\psparser.py", line 495, in __init__ 
    PSBaseParser.__init__(self, fp) 
    File "C:\Program Files\Python27\lib\site-packages\pdfminer\psparser.py", line 166, in __init__ 
    self.seek(0) 
    File "C:\Program Files\Python27\lib\site-packages\pdfminer\psparser.py", line 507, in seek 
    PSBaseParser.seek(self, pos) 
    File "C:\Program Files\Python27\lib\site-packages\pdfminer\psparser.py", line 196, in seek 
    self.fp.seek(pos) 
AttributeError: 'PDFDocument' object has no attribute 'seek' 

而這裏的我運行代碼:

# ORIGINAL CODE DOES NOT SEEM COMPATIBLE WITH THE CURRENT VERSION OF PDFMINER! 

# Code taken from: 
# https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text 

def pdf_to_csv(filename): 
    from cStringIO import StringIO 
    from pdfminer.converter import LTChar, TextConverter 
    from pdfminer.layout import LAParams 
    # from pdfminer.pdfparser import PDFDocument, PDFParser  # Not compatible with current version of PDFMiner 
    from pdfminer.pdfparser import PDFParser 
    from pdfminer.pdfdocument import PDFDocument 
    from pdfminer.pdfpage import PDFPage 
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 

    class CsvConverter(TextConverter): 
     def __init__(self, *args, **kwargs): 
      TextConverter.__init__(self, *args, **kwargs) 

     def end_page(self, i): 
      from collections import defaultdict 
      lines = defaultdict(lambda : {}) 
      for child in self.cur_item._objs:     #<-- changed 
       if isinstance(child, LTChar): 
        (_,_,x,y) = child.bbox 
        line = lines[int(-y)] 
        line[x] = child._text.encode(self.codec) #<-- changed 

      for y in sorted(lines.keys()): 
       line = lines[y] 
       self.outfp.write(";".join(line[x] for x in sorted(line.keys()))) 
       self.outfp.write("\n") 

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module 
    rsrc = PDFResourceManager() 
    outfp = StringIO() 
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams()) 
    # because my test documents are utf-8 (note: utf-8 is the default codec) 

    # doc = PDFDocument()        # Raises error with current version of PDFMiner 
                 # --> TypeError: __init__() takes at least 2 arguments (1 given) 
    fp = open(filename, 'rb') 
    parser = PDFParser(fp) 
    doc = PDFDocument(parser,'')      # Inserted ahead of 'parser.set_document(doc)' to avoid error 
                 # --> UnboundLocalError: local variable 'doc' referenced before assignment 
    parser.set_document(doc) 
    # doc.set_parser(parser)       # Not compatible with current version of PDFMiner 
    # doc.initialize('')        # Not compatible with current version of PDFMiner 

    interpreter = PDFPageInterpreter(rsrc, device) 

    # for i, page in enumerate(doc.get_pages()):  # Not compatible with current version of PDFMiner 
    for i, page in PDFPage.get_pages(doc): 
     outfp.write("START PAGE %d\n" % i) 
     if page is not None: 
      interpreter.process_page(page) 
     outfp.write("END PAGE %d\n" % i) 
     # data = retstr.getvalue() 

    device.close() 
    fp.close() 

    return outfp.getvalue() 

sourcefile = 'testfile1.pdf' 
# sourcefile = 'testfile2.pdf' 
# sourcefile = 'testfile3.pdf' 

pdf_to_csv(sourcefile) 
print 'Done.' 

任何人都可以看到這是怎麼回事?我是否需要改變我如何調用解析器(參數,序列等)?

我跑的Python 2.7.12 & PDFMiner在Windows 10 '20140328'

回答

1

嘗試用

for i, page in enumerate(PDFPage.create_pages(doc)): 

更換線

for i, page in PDFPage.get_pages(doc): 

中的代碼示例「 this page of the PDFMiner documentation的「基本用法」部分建議使用create_pages遍歷文檔中的頁面。當您記錄pag的索引時e在變量i中,我已將create_pages的呼叫包裝在enumerate中。

+0

這是我必須糾正的錯誤之一--enumerate()語句引發錯誤:「AttributeError:'PDFDocument'對象沒有屬性'get_pages'」。 –

+0

@Big_Al_Tx:我不明白你的評論。如果我認爲我只是重複你自己所做的更正之一,它就會發現我(對不起,如果我誤解了)。不是這種情況。你提到的錯誤與我建議的替換無關,因爲我的替換使用'create_pages'方法,而不是'get_pages'。此外,我用這個修改運行了你的代碼,並且在你的問題中完成了沒有錯誤。 –

+0

對不起,我誤解了您推薦的更改。是的,它確實沒有錯誤地完成。謝謝。 –