2014-01-10 29 views
3

從昨天開始,我試圖使用python-poppler-qt4從一些pdf中的突出顯示的註釋中提取文本。使用poppler-qt4/python-poppler-qt4提取來自明亮文本的文本

根據this documentation,看起來像我必須使用Page.text()方法獲取文本,並從使用Annotation.boundary()的高亮註釋中傳遞Rectangle參數。但我只獲得空白文本。有人能幫我嗎?我把我的代碼belloew和我正在使用的pdf的鏈接。謝謝您的幫助!

import popplerqt4 
import sys 
import PyQt4 


def main(): 

    doc = popplerqt4.Poppler.Document.load(sys.argv[1]) 
    total_annotations = 0 
    for i in range(doc.numPages()): 
     page = doc.page(i) 
     annotations = page.annotations() 
     if len(annotations) > 0: 
      for annotation in annotations: 
       if isinstance(annotation, popplerqt4.Poppler.Annotation): 
        total_annotations += 1 
        if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)): 
         print str(page.text(annotation.boundary())) 
    if total_annotations > 0: 
     print str(total_annotations) + " annotation(s) found" 
    else: 
     print "no annotations found" 

if __name__ == "__main__": 
    main() 

測試PDF: https://www.dropbox.com/s/10plnj67k9xd1ot/test.pdf

回答

5

看着the documentation for Annotations似乎邊界特性返回的標準座標此標註的邊界矩形。雖然這似乎是一個奇怪的決定,但我們可以簡單地通過page.pageSize().width().height()值縮放座標。

import popplerqt4 
import sys 
import PyQt4 


def main(): 

    doc = popplerqt4.Poppler.Document.load(sys.argv[1]) 
    total_annotations = 0 
    for i in range(doc.numPages()): 
     #print("========= PAGE {} =========".format(i+1)) 
     page = doc.page(i) 
     annotations = page.annotations() 
     (pwidth, pheight) = (page.pageSize().width(), page.pageSize().height()) 
     if len(annotations) > 0: 
      for annotation in annotations: 
       if isinstance(annotation, popplerqt4.Poppler.Annotation): 
        total_annotations += 1 
        if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)): 
         quads = annotation.highlightQuads() 
         txt = "" 
         for quad in quads: 
          rect = (quad.points[0].x() * pwidth, 
            quad.points[0].y() * pheight, 
            quad.points[2].x() * pwidth, 
            quad.points[2].y() * pheight) 
          bdy = PyQt4.QtCore.QRectF() 
          bdy.setCoords(*rect) 
          txt = txt + unicode(page.text(bdy)) + ' ' 

         #print("========= ANNOTATION =========") 
         print(unicode(txt)) 

    if total_annotations > 0: 
     print str(total_annotations) + " annotation(s) found" 
    else: 
     print "no annotations found" 

if __name__ == "__main__": 
    main() 

此外,我決定來串聯.highlightQuads()得到的到底是什麼,強調了更好的代表性。

請注意我已附加到文本的每個四邊形區域的明確<space>

在示例文檔中,返回的QString無法直接傳遞給print()str(),解決方案是使用unicode()來代替。

我希望這可以幫助一個人,因爲它幫助了我。

說明:頁面旋轉可能會影響縮放值,我還沒有能夠測試這個。

+0

謝謝,我努力安裝popplerqt4,但這工作就像一個魅力! – magicrebirth