2016-03-08 87 views
0

我有一些PDF文檔,我無法使用PyPDF僅使用PDFMiner提取文本。以下代碼可以正常工作以從PDF中提取所有文本,它會遍歷整個文檔,然後返回所有文本。 有沒有辦法只能使用PDF的某些頁面? 我擁有的PDF格式都是2000-3000多長,我只需要每隔一頁就完成一次。使用PDFMiner處理單色頁面

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from pdfminer.pdfpage import PDFPage 
from cStringIO import StringIO 

def convert_pdf_to_txt(path): 
    rsrcmgr = PDFResourceManager() 
    retstr = StringIO() 
    codec = 'utf-8' 
    laparams = LAParams() 
    device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams) 
    fp = file(path, 'rb') 
    interpreter = PDFPageInterpreter(rsrcmgr, device) 
    password = "" 
    maxpages = 0 
    caching = True 
    pagenos=set() 

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): 
     interpreter.process_page(page) 

    text = retstr.getvalue() 

    fp.close() 
    device.close() 
    retstr.close() 
    return text 

回答

1

你不能使用enumerate獲得頁面數和同時通過所有頁面遍歷網頁的內容?如果您只需要每隔一頁,請使用模數。如果您只想要特定頁面,請使用範圍。

例子:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter 
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from pdfminer.pdfpage import PDFPage 
from cStringIO import StringIO 

def convert_pdf_to_txt(path): 
    rsrcmgr = PDFResourceManager() 
    retstr = StringIO() 
    codec = 'utf-8' 
    laparams = LAParams() 
    device = TextConverter(rsrcmgr, retstr, codec=codec,laparams=laparams) 
    fp = file(path, 'rb') 
    interpreter = PDFPageInterpreter(rsrcmgr, device) 
    password = "" 
    maxpages = 0 
    caching = True 
    pagenos=set() 

    for pagenumber, page in enumerate(PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True)): 
     print pagenumber 
     if pagenumber % 2 == 0: 
      print("even page number") 
      interpreter.process_page(page) 
     else: 
      print("odd page number") 
     if 5 <= pagenumber <= 10: 
      print("pages 5 to 10") 

    text = retstr.getvalue() 

    fp.close() 
    device.close() 
    retstr.close() 
    return text 
+0

謝謝,這就是我一直在尋找。 – user2665140