使用PDFminer作爲庫：「AttributeError：'NoneType'對象沒有屬性'getobj'」

我正在編寫一個用於上傳PDF文件並在過程中解析它們的腳本。爲解析我使用PDFminer。使用PDFminer作爲庫：「AttributeError：'NoneType'對象沒有屬性'getobj'」

對於打開文件轉換成PDFMiner文件，我使用下面的函數，整齊地跟隨你可以在上面的鏈接找到的說明：

def load_document(self, _file = None): 
    """turn the file into a PDFMiner document""" 
    if _file == None: 
     _file = self.options['file'] 

    parser = PDFParser(_file) 
    doc = PDFDocument() 
    doc.set_parser(parser) 
    if self.options['password']: 
     password = self.options['password'] 
    else: 
     password = "" 
    doc.initialize(password) 
    if not doc.is_extractable: 
     raise ValueError("PDF text extraction not allowed") 

    return doc

預期的結果當然是一個不錯PDFDocument實例，但而是我得到一個錯誤：

Traceback (most recent call last): 
    File "bzk_pdf.py", line 45, in <module> 
    cli.run_cli(BZKPDFScraper) 
    File "/home/toon/Projects/amcat/amcat/scripts/tools/cli.py", line 61, in run_cli 
    instance = cls(options) 
    File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 44, in __init__ 
    self.doc = self.load_document() 
    File "/home/toon/Projects/amcat/amcat/scraping/pdf.py", line 56, in load_document 
    doc.set_parser(parser) 
    File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdfparser.py", line 327, in set_parser 
    self.info.append(dict_value(trailer['Info'])) 
    File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 132, in dict_value 
    x = resolve1(x) 
    File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 60, in resolve1 
    x = x.resolve() 
    File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 49, in resolve 
    return self.doc.getobj(self.objid) 
AttributeError: 'NoneType' object has no attribute 'getobj'

我不知道在哪裏看，我還沒有找到其他人有同樣的問題。

一些額外的信息，這可能有助於：

這裏是我的測試文件：http://www.2shared.com/document/kM_wrI3J/testpdf.html
_file是一個django File object，但使用普通文件有相同的結果
pdfminer版本： 'pdfminer-20110515'
Django：1.4.3（我認爲不重要）
Python 2.7.3

來源

2013-02-17 ToonAlfrink

小了點，但我認爲你的意思是1.4.3版本的Django。 – 2013-02-17 09:26:40

有沒有人得到答案？或試圖重現這個問題？我真的需要一個答案... – ToonAlfrink 2013-02-17 12:12:23

嘗試打開該文件，並把它發送到解析器，像這樣：

with open(_file,'rb') as f: 
    parser = PDFParser(f) 
    # your normal code here

你現在正在做的方式，我懷疑您發送的文件名作爲參數。

來源

2013-02-17 09:28:31

我很抱歉，但事實並非如此。正如我最後說的，'_file'是一個Django File對象，使用普通文件具有相同的效果。 – ToonAlfrink 2013-02-17 09:41:29

有了一些嘗試我發現，我錯過了一行：

parser.set_document（DOC）

已經加入該行，該功能現在工作。

對我來說看起來很糟糕的圖書館設計，但它可能是我錯過了一些東西，這只是糾正錯誤。

無論如何，我現在有一個PDF文件，我需要的數據。

這裏的最終結果是：

def load_document(self, _file = None): 
    """turn the file into a PDFMiner document""" 
    if _file == None: 
     _file = self.options['file'] 

    parser = PDFParser(_file) 
    doc = PDFDocument() 
    parser.set_document(doc) 
    doc.set_parser(parser) 

    if 'password' in self.options.keys(): 
     password = self.options['password'] 
    else: 
     password = "" 

    doc.initialize(password) 

    if not doc.is_extractable: 
     raise ValueError("PDF text extraction not allowed") 

    return doc

來源

2013-02-17 12:33:26 ToonAlfrink

使用PDFminer作爲庫：「AttributeError：'NoneType'對象沒有屬性'getobj'」

回答

相關問題