2014-10-28 177 views
0

使用iTextSharp的,我有以下的代碼,成功地翻出了PDF文本爲廣大PDF的我想讀的......PdfTextExtractor.GetTextFromPage沒有返回正確的文本

PdfReader reader = new PdfReader(fileName); 
for (int i = 1; i <= reader.NumberOfPages; i++) 
{ 
    text += PdfTextExtractor.GetTextFromPage(reader, i); 
} 
reader.Close(); 

然而,我的一些PDF格式的有XFA表單(已經被填寫),這將導致「文本」字段來填充下面的垃圾......

"Please wait... \n \nIf this message is not eventually replaced by the proper contents of the document, your PDF \nviewer may not be able to display this type of document. \n \nYou can upgrade to the latest version of Adobe Reader for Windows®, Mac, or Linux® by \nvisiting http://www.adobe.com/products/acrobat/readstep2.html. \n \nFor more assistance with Adobe Reader visit http://www.adobe.com/support/products/\nacrreader.html. \n \nWindows is either a registered trademark or a trademark of Microsoft Corporation in the United States and/or other countries. Mac is a trademark \nof Apple Inc., registered in the United States and other countries. Linux is the registered trademark of Linus Torvalds in the U.S. and other \ncountries." 

我如何解決此問題?我嘗試使用iTextSharp的PdfStamper [1]來壓扁PDF,但這不起作用 - 生成的流具有相同的垃圾文本。

[1] How to flatten already filled out PDF form using iTextSharp

回答

1

您面臨着充當XML流的容器的PDF。此XML流基於XML Forms Architecture(XFA)。你看到的消息是不是垃圾!這是在瀏覽器中打開文檔時顯示的PDF頁面中包含的消息,該文檔就像普通PDF一樣讀取文件。

例如:如果你在蘋果預覽打開文檔,你會看到完全一樣的消息,因爲蘋果預覽不能渲染XFA表單。使用iText解析文件中包含的PDF時,您收到此消息時不應該感到驚訝。這正是您的文件中存在的PDF內容。在Adobe Reader中打開文檔時看到的內容不是以PDF語法存儲的,而是以XML流形式存儲的。

你說你已經嘗試在答案中描述的問題How to flatten already filled out PDF form using iTextSharp拼合PDF。 但是,這個問題是關於基於AcroForm技術的表單扁平化。它不應該與XFA表單一起使用。如果你想變平的XFA表單,你需要在iText的頂部使用XFA Worker

[JAVA]

Document document = new Document(); 
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(dest)); 
XFAFlattener xfaf = new XFAFlattener(document, writer); 
xfaf.flatten(new PdfReader(baos.toByteArray())); 
document.close(); 

[C#]

Document document = new Document(); 
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream(dest, FileMode.Create)); 
XFAFlattener xfaf = new XFAFlattener(document, writer); 
ms.Position = 0; 
xfaf.Flatten(new PdfReader(ms)); 
document.Close(); 

的結果這個扁平化過程是一個普通的PDF,可以通過您的原始代碼進行分析。