使用CAM :: PDF的Perl - 無法從PDF中提取圖像

我有一個pdf文件，listimages.pl使用CAM :: PDF什麼也不返回，但使用PDF :: GetImages將提取圖像。使用下面的代碼我可以找到圖像對象，但我不知道如何將其提取到文件。而我無法弄清楚爲什麼命令行工具不起作用。使用CAM :: PDF的Perl - 無法從PDF中提取圖像

#!/usr/bin/perl -w 
use strict; 

use Cwd; 
use File::Basename; 
use Data::Dumper; 
use CAM::PDF; 
use CAM::PDF::PageText; 
use CAM::PDF::Renderer::Images; 

my $file = shift @ARGV || die "Usage: get-pdf-images /path/to/file.pdf \n"; 

my $pdf = CAM::PDF->new($file) || die "$CAM::PDF::errstr\n"; 

#print $pdf->toString(); 

foreach my $p (1 .. $pdf->numPages()) { 
    my $page = $pdf->getPageContentTree($p); 
    my $str = $pdf->getPageText($p); 
    if (defined $str) { 
#  CAM::PDF->asciify(\$str); 
     print $str; 
    } 

    print "-------------------------------\n"; 
    my $gs = $page->findImages(); 
    my @imageNodes = @{$gs->{images}}; 
    print "Found " . scalar @imageNodes . " images on page $p\n"; 
    print Data::Dumper->Dump([\@imageNodes],['imageNodes']); 
}

如果我跑`pdfinfo.pl``它報告：

$ pdfinfo.pl test.pdf 
File:   test.pdf 
File Size: 4599 bytes 
Pages:  1 
Author:  þÿadmin01 
CreationDate: Fri Jan 3 03:48:53 2014 
Creator:  þÿPDFCreator Version 1.7.2 
Keywords: 
ModDate:  Fri Jan 3 03:48:53 2014 
Producer:  GPL Ghostscript 9.10 
Subject: 
Title:  þÿVision6Card 
Page Size: variable 
Optimized: no 
PDF version: 1.4 
Security 
    Passwd:  none 
    Print:  yes 
    Modify:  yes 
    Copy:  yes 
    Add:  yes

中的test.pdf文件可以從這裏下載：http://imaptools.com:8080/dl/test.pdf

來源

2014-01-16 Stephen Woodbridge

有問題的圖像是一個3x10像素的圖像，它被編碼爲內聯圖像。也許listimages.pl只識別xobject圖像？ Adobe Acrobat在分析內部PDF結構時進行預檢，並在此圖像上顯示「PDFEngine錯誤：嚴重性：4，系統：0，錯誤：3」。因此，圖像嵌入可能被破壞，因此listimages.pl找不到它？此外，當顯示PDF時，我看不到該圖像。也許listimages.pl只提取可見圖像？ – mkl

我也從http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx得到錯誤，但我不認爲這是問題，因爲PDF :: GetImages和命令行工具pdfimages都成功提取圖片。我正在使用CAM :: PDF來提取其他信息，並希望使用它來提取圖像。 –

的CAM::PDF某些部分沒有完成。如果您查看listimages.pl的來源，您會看到內容解析爲內嵌圖像是有點原始的，例如，它不允許在BI和EI之間無法匹敵（因爲是這樣），所以在這裏找不到圖片。有uninlinepdfimages.pl，它使用另一個啓發式來解析內聯圖像，但對於這個文件它似乎掛起，我沒有打算查看什麼混淆它。而且，CAM::PDF::Renderer::Images就像在你的代碼中一樣，是另一個同樣的問題，最後它對內容流進行了適當的解析，但是這個庫似乎沒有提供幫助在這裏提取圖像數據的手段。但是，如果您非常需要它，我會在@imageNodes（寬度，高度，深度，使用壓縮，圖像數據）中提供信息以編程方式提取圖像，從而看到沒有任何技術問題（除了您的時間）。

來源

2014-01-16 19:12:29 user2846289

同意。我是CAM-PDF的作者。當我第一次寫它時（早在2002年），我試圖實現一些非常具體的目標，並且根據需要添加了功能。許多更高級別的工具（如listimages.pl和pdftotext.pl）只是啓發式的，甚至沒有試圖涵蓋所有可能性。 –

感謝您的所有反饋和建議。事實證明，這個例子中的3x10圖片並不是我想要的。因此，我採用了使用CAM :: PDF提取需要的文本，然後使用ImageMagick將PDF呈現爲jpg的方法。我是操作PDF的新手，我學到了很多 - 謝謝！ –

使用CAM :: PDF的Perl - 無法從PDF中提取圖像

回答

相關問題