我通過使用模板並填寫表單字段來創建PDF文件。然後我平整PDF以防止對其進行更改。我現在需要解析PDF並從表單字段獲取數據;但是，當我解析PDF時，表單字段所在的文本缺失。看來我不能引用字段，因爲PDF被夷爲平地，解析PDF跳過其中的文字是和返回使用itextsharp解析PDF文檔 - 缺少拼合的表單字段值

名字字段：姓：

但PDF實際上有

名字：簡姓：李四

我怎樣才能在表單字段用來是文本？

UPDATE

Dim text As StringBuilder = New StringBuilder() 

If File.Exists(filename) Then 
    Dim pdfReader As New PdfReader(filename) 

    For page As Integer = 1 To pdfReader.NumberOfPages 
     Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy() 
     Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy) 

     currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText))) 
      text.Append(currentText) 
    Next 

    pdfReader.Close() 

    textBox1.Text = text.ToString() 
    textBox1.SelectionStart = 0 
End If

我不能在原始文件後，由於它們內部的信息，但我可以張貼說明我在做什麼2示例文件。

我使用一個模板PDF這樣的... fw4.pdf

我然後用數據填充它和壓扁它，所以它是這樣的... final_fw4.pdf

當我解析使用的代碼上面我得到這個... parsed_pdf_text.txt
view the files

無數據是在解析的文字！

來源

2013-06-27 R.Keith

請給出你如何做文本解析。特別是，您正在使用哪種文本提取策略。另外提供一個樣本PDF文件。 – mkl

已添加文件和更多細節！ –

我看到你使用簡單的文本提取策略。你是否也嘗試過位置文本提取策略？簡單的假設內容流已經處於正確的閱讀順序，在形式扁平的情況下它肯定不是。 – mkl

你的問題的分析是不正確：

然而，當我解析PDF當表單字段是缺少

不，它不是缺少文本。它只是不在你期望的地方。如果您搜索parsed_pdf_text.txt爲「JA」，你會在一個塊中找到扁平條目都在一起：

Ja 
Ja 
Ja 
8 
0 
1 
16 
28 
Jane Doe 532 12 1234 
100 North Cujo Street 
Nome, AK 67201 
4 4 9 
10 
11 
Walmart, Nome, AK 
WAL666 AB 4321

的原因是，在你的問題評論已經指出的，您使用SimpleTextExtractionStrategy

Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy() 
Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)

看一看的類註釋：

* This renderer keeps track of the current Y position of each string. If it detects 
* that the y position has changed, it inserts a line break into the output. If the 
* PDF renders text in a non-top-to-bottom fashion, this will result in the text not 
* being a true representation of how it appears in the PDF. 
* 
* This renderer also uses a simple strategy based on the font metrics to determine if 
* a blank space should be inserted into the output.

展平爲所述內容的形式的信息在內容流的末尾添加，從而該文本出現在頁面文本的末尾。

您可能需要改用LocationTextExtractionStrategy。其評論表明：

* A text extraction renderer that keeps track of relative position of text on page 
* The resultant text will be relatively consistent with the physical layout that most 
* PDF files have on screen. 
* <br> 
* This renderer keeps track of the orientation and distance (both perpendicular 
* and parallel) to the unit vector of the orientation. Text is ordered by 
* orientation, then perpendicular, then parallel distance. Text with the same 
* perpendicular distance, but different parallel distance is treated as being on 
* the same line. 
* <br> 
* This renderer also uses a simple strategy based on the font metrics to determine if 
* a blank space should be inserted into the output.

這仍然不是最佳的，但在你的情況下可能會更好。

我現在有一個需要解析PDF文件和表單字段獲取數據

如果你只有形式數量有限，您可以調查原始表單域的位置，只解析這些字段位置的文本。在那種情況下，將FilteredRenderListener與RegionTextRenderFilter結合使用可能是有意義的。

來源

2013-06-28 22:26:18 mkl

使用Javascript操作可以在頁面加載時設置文本。但無論如何，很想看到一個文件

來源

2013-06-27 16:31:09

已添加文件和更多詳細信息！ –

這是基於沒有文件來看待的推測。

如果通過拼合，你的意思是「將表單數據放在內容中」，那麼數據可能以任何一種容易訪問的方式消失。窗體上的數據由窗口小部件註釋表示。爲了扁平化表單，您可以爲外觀指定一個窗口小部件註釋的實例（或創建一個）並追加到頁面內容流中，以包含PDF代碼以渲染表單字段，最後刪除註釋。

下面是我在文件中看到的內容 - 第一頁有幾個內容流。最後內容流包含此摘錄：

Q q Q q 1 0 0 1 501.46 481.92 cm /Xi0 Do Q q Q q 1 0 0 1 500.87 457.9 cm /Xi1 Do Q q Q

，它是（或多或少）：

grestore 
gsave 
grestore 
gsave 
    translate(501.46, 481.92) 
    XObject("Xi0") 
grestore 
gsave 
grestore 
gsave 
    translate(500.87, 457.9) 
    XObject("Xi1") 
grestore 
gsave 
grestore

XI0是在該文件中的對象＃1，這是一個表單x對象，其具有以下內容流：

q Q /Tx BMC q 0 0 26.03 12.33 re W n q BT 1 0 0 1 8.01 2.93 Tm /HeBo 9 Tf 
1 0.59 0 0.11 k (Ja)Tj 0 g ET Q Q EMC

，它是（或多或少）：

你的文本在那裏，它正在做我所推測的。更有趣的問題是，「爲什麼當我使用iTextSharp提取文本時我沒有看到它？我不知道，因爲我沒有在iTextSharp上工作，但我使用做了工作在Adobe Acrobat上，我在Acrobat 1.0中用於搜索的文本提取引擎的其他工作，所以我知道從PDF中提取文本是多麼具有挑戰性，而且大多數產品都會因爲這些挑戰而出錯或嚴重，或者兩者兼而有之。很可能，iTextSharp迭代通過內容流和任何文本運算符，它會聚合動作和狀態（即，「將此文本放置在此字體和此顏色和渲染模式中」），但是它可能不會對XObject進行遞歸調用，它因此完全失去了通過扁平形式創建的所有東西。

簡短的回答很可能是iTextSharp中的一個bug，並且o值得向他們報告。

通常情況下，我會向您指出我公司的工具，但目前我沒有所需的「扁平化」功能。然而。

如果我是你，我會採取編寫代碼的方法來做自己扁平化的工作。實際上，您需要迭代小部件註釋，而不是將它們的外觀流寫入頁面內容，而是編寫實際的PDF內容。

此外，PDF格式的輸出可能會更好。沒有任何理由空的冗餘gsave/grestore對也不應該有一個無效的顏色變化。幸運的是，這些事情是良性的。

來源

2013-06-27 16:53:50 plinth

平展如'pdfStamper.FormFlattening =真' –

文件和更多細節已添加！ –

*，但它可能不會對XObjects *進行遞歸調用*從版本5.0.1開始，它一直在遞歸。而那已經是很多年前的事了。更可能的是它的選擇提取策略。簡單的一個在這裏是不夠的。 – mkl

使用itextsharp解析PDF文檔 - 缺少拼合的表單字段值

UPDATE

回答

相關問題