2014-01-12 79 views
0

有沒有什麼辦法從通過Google app引擎上傳的PDF文件中提取文本和documentInfo?我想用PyPDF2,和我的代碼是這樣的:如何從使用PyPDF2上傳到Google App Engine的PDF中提取文本?

pdf_file = self.request.POST['file'].file 
pdf_reader = pypdf.PdfFileReader(pdf_file) 

這給了我錯誤:

Traceback (most recent call last): 
.... 
    File "/myrepo/myproj/main.py", line 154, in post 
    pdf_text = pypdf.PdfFileReader(pdf_file) 
    File "lib/PyPDF2/pdf.py", line 649, in __init__ 
    self.read(stream) 
    File "lib/PyPDF2/pdf.py", line 1100, in read 
    raise utils.PdfReadError, "EOF marker not found" 
PdfReadError: EOF marker not found 

它給這個錯誤的任何文件,甚至對於那些能夠成功地從文件上閱讀磁盤通過open(filename, 'r')

我錯過了什麼?提前致謝!

回答

1

的解決方案是使用從get_uploadsblobstore_handlers.BlobstoreUploadHandler

from google.appengine.ext.webapp import blobstore_handlers 
from cStringIO import StringIO 
import PyPDF2 

class UploadHandler(blobstore_handlers.BlobstoreUploadHandler): 
    def post(self): 
     upload_files = self.get_uploads('file') 
     blob_info = upload_files[0] 
     blob_reader = blobstore.BlobReader(blob_info) 
     blob_content = StringIO(blob_reader.read()) 
     pdf_info = PyPDF2.PdfFileReader(blob_content) 
相關問題