有一個叫做ITextExtractionStrategy
的名字很差的界面,你可以實現這個界面,當你從PDF中提取東西時,它會給你擴展的信息。我說「名字不好」,因爲雖然它說「文本」,它也支持圖像。這個接口有5個方法,其中4個是基於文本的,你可以忽略。您感興趣的方法是RenderImage
。下面就是一個完整的工作實現:
Public Class ImageInfoTextExtractionStrategy
Implements iTextSharp.text.pdf.parser.ITextExtractionStrategy
#Region " Extra Methods - Just Ignore "
Public Sub BeginTextBlock() Implements iTextSharp.text.pdf.parser.IRenderListener.BeginTextBlock
End Sub
Public Sub EndTextBlock() Implements iTextSharp.text.pdf.parser.IRenderListener.EndTextBlock
End Sub
Public Sub RenderText(renderInfo As iTextSharp.text.pdf.parser.TextRenderInfo) Implements iTextSharp.text.pdf.parser.IRenderListener.RenderText
End Sub
Public Function GetResultantText() As String Implements iTextSharp.text.pdf.parser.ITextExtractionStrategy.GetResultantText
Return Nothing
End Function
#End Region
''//We'll add all image rectangles to this collection
Private _AllImageRectangles As New List(Of iTextSharp.text.Rectangle)
Public ReadOnly Property AllImageRectangles As List(Of iTextSharp.text.Rectangle)
Get
Return Me._AllImageRectangles
End Get
End Property
Public Sub RenderImage(renderInfo As iTextSharp.text.pdf.parser.ImageRenderInfo) Implements iTextSharp.text.pdf.parser.IRenderListener.RenderImage
''//Get the image's matrix
Dim m = renderInfo.GetImageCTM()
Dim w, h, x, y As Single
''//Get the various parameters from the matrix
w = m(iTextSharp.text.pdf.parser.Matrix.I11)
h = m(iTextSharp.text.pdf.parser.Matrix.I22)
x = m(iTextSharp.text.pdf.parser.Matrix.I31)
y = m(iTextSharp.text.pdf.parser.Matrix.I32)
''//Turn the parameters into a rectangle
Me._AllImageRectangles.Add(New iTextSharp.text.Rectangle(x, y, x + w, y + h))
End Sub
End Class
要使用此子類,我們把它傳遞給(再次評爲很差)方法iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage()
。通常你會調用這個方法,並將字符串結果賦給一個變量,但在我們的例子中,我們不關心文本,所以不這樣做。使用它你會這樣做:
''//Path to our pdf with images
Dim PdfWithImage = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "PdfWithImage.pdf")
''//Bind a reader to our PDF
Dim reader As New PdfReader(PdfWithImage)
''//Create an instance of our custom extraction class
Dim strat As New ImageInfoTextExtractionStrategy()
''//Loop through each page in our PDF
For I = 1 To reader.NumberOfPages
''//The GetTextFromPage method does the work even though we are working with images
iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, I, strat)
Next
''//Get all image rectangles found
Dim Rects = strat.AllImageRectangles
For Each R In Rects
''//Do something with your rectangles here
Next
我真的很抱歉我寫的文字,我慢慢地學習。 – XenKid 2012-03-21 19:00:28
我明白了,非常感謝,它工作正常。 – XenKid 2012-03-21 19:22:50