2016-12-15 27 views
1

有人可以告訴我,如果有一種方法可以從PDF獲取X,Y座標中的每個字符位置。 我明白,它可能不是XY我只需要一種方法來確定文本字符在頁面上的位置。 字符不是光柵,所以我不需要識別它們。 我已經開始這個。Powershell C#得到PDF字符位置

$Path = "C:\temp\test.pdf" 

$reader = New-Object iTextSharp.text.pdf.pdfreader -ArgumentList $Path 

for ($page = 1; $page -le $reader.NumberOfPages; $page++) 
{ 
$text = [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($reader,$page).Split([char]0x000A) 
} 

$reader.Close() 

回答

1

我對PowerShell並不熟悉,但您可以在C#中這樣做。僅供參考,您需要iTextSharp 5.5.10或iText 7.0.1 for .NET才能運行。

void Run() 
{ 
    PdfReader reader = new PdfReader("/path/to/input.pdf"); 

    var s = PdfTextExtractor.GetTextFromPage(reader, 1, new LocationTextExtractionStrategy(new Local())); 
} 

private class Local : LocationTextExtractionStrategy.ITextChunkLocationStrategy 
    { 

    public LocationTextExtractionStrategy.ITextChunkLocation CreateLocation(TextRenderInfo renderInfo, LineSegment baseline) 
    { 
     // you need the info per character, so iterate all characters per TextRenderInfo 
     foreach (TextRenderInfo tr in renderInfo.GetCharacterRenderInfos()) 
     { 
      LineSegment bl = tr.GetBaseline(); 
      // do something with the info 
      Console.WriteLine(tr.GetText() + " @ (" + bl.GetStartPoint()[Vector.I1] + ", " + bl.GetStartPoint()[Vector.I2] + ")"); 
     } 
     return new LocationTextExtractionStrategy.TextChunkLocationDefaultImp(baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth()); 
    } 
} 
0

基於blagae答案,這裏是一個PowerShell腳本,將基本用完他的C#代碼。我沒有找到一個簡單的方法直接在PowerShell中使用LocationTextExtractionStrategy。您將需要iTextSharp 5.5.10,因爲它是公開LocationTextExtractionStrategy的第一個公開版本。

$Source = @" 
     using System; 
     using iTextSharp.text.pdf; 
     using iTextSharp.text.pdf.parser; 

     public class PdfHelper 
     { 
      public static void Run(string filePath) 
      { 
       PdfReader reader = new PdfReader(filePath); 
       for(var page = 1; page <= reader.NumberOfPages; page++) 
       { 
        PdfTextExtractor.GetTextFromPage(reader, page, new LocationTextExtractionStrategy(new Local())); 
       } 
      } 
     } 

     class Local : LocationTextExtractionStrategy.ITextChunkLocationStrategy 
     { 
      public LocationTextExtractionStrategy.ITextChunkLocation CreateLocation(TextRenderInfo renderInfo, LineSegment baseline) 
      { 
       // you need the info per character, so iterate all characters per TextRenderInfo 
       foreach (TextRenderInfo tr in renderInfo.GetCharacterRenderInfos()) 
       { 
        LineSegment bl = tr.GetBaseline(); 
        // do something with the info 
        Console.WriteLine(tr.GetText() + " @ (" + bl.GetStartPoint()[Vector.I1] + ", " + bl.GetStartPoint()[Vector.I2] + ")"); 
       } 
       return new LocationTextExtractionStrategy.TextChunkLocationDefaultImp(baseline.GetStartPoint(), baseline.GetEndPoint(), renderInfo.GetSingleSpaceWidth()); 
      } 
     } 
"@ 

$DLLPath = "$PSScriptRoot\iTextSharp.dll" 
Add-Type -Path $DLLPath 
Add-Type -ReferencedAssemblies $DLLPath -TypeDefinition $Source -Language CSharp 

$Path = "C:\temp\test.pdf" 
[PdfHelper]::Run($Path) 
+0

謝謝你的幫助 – James