2009-12-16 84 views
5
分割pdf

我想使用pyPdf根據大綱中的每個目標位置指向pdf中的不同頁面的大綱來分割pdf文件。根據大綱

例如輪廓:

 
main  --> points to page 1 
    sect1 --> points to page 1 
    sect2 --> points to page 15 
    sect3 --> points to page 22 

是pyPdf可輕鬆遍歷文件或文檔的輪廓每個目的地的每個頁面;然而,我無法弄清楚如何獲取目的地指向的頁碼。

是否有人知道如何找到大綱中每個目的地的引用頁碼?

回答

6

我理解了它:

 
    class Darrell(pyPdf.PdfFileReader): 

     def getDestinationPageNumbers(self): 
      def _setup_outline_page_ids(outline, _result=None): 
       if _result is None: 
        _result = {} 
       for obj in outline: 
        if isinstance(obj, pyPdf.pdf.Destination): 
         _result[(id(obj), obj.title)] = obj.page.idnum 
        elif isinstance(obj, list): 
         _setup_outline_page_ids(obj, _result) 
       return _result 

      def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None): 
       if _result is None: 
        _result = {} 
       if pages is None: 
        _num_pages = [] 
        pages = self.trailer["/Root"].getObject()["/Pages"].getObject() 
       t = pages["/Type"] 
       if t == "/Pages": 
        for page in pages["/Kids"]: 
         _result[page.idnum] = len(_num_pages) 
         _setup_page_id_to_num(page.getObject(), _result, _num_pages) 
       elif t == "/Page": 
        _num_pages.append(1) 
       return _result 

      outline_page_ids = _setup_outline_page_ids(self.getOutlines()) 
      page_id_to_page_numbers = _setup_page_id_to_num() 

      result = {} 
      for (_, title), page_idnum in outline_page_ids.iteritems(): 
       result[title] = page_id_to_page_numbers.get(page_idnum, '???') 
      return result 

    pdf = Darrell(open(PATH-TO-PDF, 'rb')) 
    template = '%-5s %s' 
    print template % ('page', 'title') 
    for p,t in sorted([(v,k) for k,v in pdf.getDestinationPageNumbers().iteritems()]): 
     print template % (p+1,t) 
1

達雷爾的類可以稍微進行修改,以產生內容的多級表爲一個pdf

(在PDFTK工具箱pdftoc的方式。)我的修改爲_setup_page_id_to_num添加了一個參數,整數「level」默認爲1.每次調用都會遞增關卡。我們不是隻在結果中存儲頁碼,而是存儲一對頁碼和級別。使用返回的結果時應適用適當的修改。

我正在使用它來實現基於瀏覽器的「PDF Hacks」基於瀏覽器的頁面時間文檔查看器,其中包含反映LaTeX部分,子部分等書籤的側欄目錄表。我正在開發一個共享系統,其中pdftk無法安裝,但python可用。

0

這正是我正在尋找的。 Darrell對PdfFileReader的補充應該是PyPDF2的一部分。

我寫了一個小配方,使用PyPDF2和sejda控制檯通過書籤拆分PDF。在我的情況下,有幾個我想要保持在一起的第1級部分。這個腳本允許我這樣做,併爲結果文件提供有意義的名稱。

import operator 
import os 
import subprocess 
import sys 
import time 

import PyPDF2 as pyPdf 

# need to have sejda-console installed 
# change this to point to your installation 
sejda = 'C:\\sejda-console-1.0.0.M2\\bin\\sejda-console.bat' 

class Darrell(pyPdf.PdfFileReader): 
    ... 

if __name__ == '__main__': 
    t0= time.time() 

    # get the name of the file to split as a command line arg 
    pdfname = sys.argv[1] 

    # open up the pdf 
    pdf = Darrell(open(pdfname, 'rb')) 

    # build list of (pagenumbers, newFileNames) 
    splitlist = [(1,'FrontMatter')] # Customize name of first section 

    template = '%-5s %s' 
    print template % ('Page', 'Title') 
    print '-'*72 
    for t,p in sorted(pdf.getDestinationPageNumbers().iteritems(), 
         key=operator.itemgetter(1)): 

     # Customize this to get it to split where you want 
     if t.startswith('Chapter') or \ 
      t.startswith('Preface') or \ 
      t.startswith('References'): 

      print template % (p+1, t) 

      # this customizes how files are renamed 
      new = t.replace('Chapter ', 'Chapter')\ 
        .replace(': ', '-')\ 
        .replace(': ', '-')\ 
        .replace(' ', '_') 
      splitlist.append((p+1, new)) 

    # call sejda tools and split document 
    call = sejda 
    call += ' splitbypages' 
    call += ' -f "%s"'%pdfname 
    call += ' -o ./' 
    call += ' -n ' 
    call += ' '.join([str(p) for p,t in splitlist[1:]]) 
    print '\n', call 
    subprocess.call(call) 
    print '\nsejda-console has completed.\n\n' 

    # rename the split files 
    for p,t in splitlist: 
     old ='./%i_'%p + pdfname 
     new = './' + t + '.pdf' 
     print 'renaming "%s"\n  to "%s"...'%(old, new), 

     try: 
      os.remove(new) 
     except OSError: 
      pass 

     try: 
      os.rename(old, new) 
      print' succeeded.\n' 
     except: 
      print' failed.\n' 

    print '\ndone. Spliting took %.2f seconds'%(time.time() - t0) 
0

小更新@darrell類能夠解析UTF-8的輪廓,這是我的崗位作爲答案,因爲評論是難以閱讀。

問題是在pyPdf.pdf.Destination.title其可以兩種形式被返回:

  • pyPdf.generic.TextStringObject
  • pyPdf.generic.ByteStringObject

所以從_setup_outline_page_ids()函數輸出也兩種不同類型的返回title對象,該對象將失敗與UnicodeDecodeError如果大綱標題包含任何然後ASCII。

我加入這個代碼來解決這個問題:

if isinstance(title, pyPdf.generic.TextStringObject): 
    title = title.encode('utf-8') 
全班

class PdfOutline(pyPdf.PdfFileReader): 

    def getDestinationPageNumbers(self): 

     def _setup_outline_page_ids(outline, _result=None): 
      if _result is None: 
       _result = {} 
      for obj in outline: 
       if isinstance(obj, pyPdf.pdf.Destination): 
        _result[(id(obj), obj.title)] = obj.page.idnum 
       elif isinstance(obj, list): 
        _setup_outline_page_ids(obj, _result) 
      return _result 

     def _setup_page_id_to_num(pages=None, _result=None, _num_pages=None): 
      if _result is None: 
       _result = {} 
      if pages is None: 
       _num_pages = [] 
       pages = self.trailer["/Root"].getObject()["/Pages"].getObject() 
      t = pages["/Type"] 
      if t == "/Pages": 
       for page in pages["/Kids"]: 
        _result[page.idnum] = len(_num_pages) 
        _setup_page_id_to_num(page.getObject(), _result, _num_pages) 
      elif t == "/Page": 
       _num_pages.append(1) 
      return _result 

     outline_page_ids = _setup_outline_page_ids(self.getOutlines()) 
     page_id_to_page_numbers = _setup_page_id_to_num() 

     result = {} 
     for (_, title), page_idnum in outline_page_ids.iteritems(): 
      if isinstance(title, pyPdf.generic.TextStringObject): 
       title = title.encode('utf-8') 
      result[title] = page_id_to_page_numbers.get(page_idnum, '???') 
     return result