如何從PDF獲取文本的字體名稱？

我正在尋找PDF文件中的文本提取所有不同的字體名稱。我正在使用iTextSharp DLL，下面給出的是我的代碼。如何從PDF獲取文本的字體名稱？

using System; 
using System.Collections.Generic; 
using System.Linq; 
using System.Text; 
using System.Threading.Tasks; 
using iTextSharp.text.pdf.parser; 
using iTextSharp.text.pdf; 

namespace GetFontName 
{ 
    class Program 
    { 
     static void Main(string[] args) 
     { 
      PdfReader reader = new PdfReader("C:/Users/agnihotri/Downloads/Test.pdf"); 
      HashSet<String> names = new HashSet<string>(); 
      PdfDictionary resources; 
      for (int p = 1; p <= reader.NumberOfPages; p++) 
      { 
       PdfDictionary dic = reader.GetPageN(p); 
       resources = dic.GetAsDict(PdfName.RESOURCES); 
       if (resources != null) 
       { 
        //gets fonts dictionary 
        PdfDictionary fonts = resources.GetAsDict(PdfName.FONT); 
        if (fonts != null) 
        { 

         PdfDictionary font; 

         foreach (PdfName key in fonts.Keys) 
         { 
         font = fonts.GetAsDict(key); 
         string name = font.GetAsName(iTextSharp.text.pdf.PdfName.BASEFONT).ToString(); 

          //check for prefix subsetted font 

         if (name.Length > 8 && name.ToCharArray()[7] == '+') 
         { 
         name = String.Format("%s subset (%s)", name.Substring(8), name.Substring(1, 7)); 

         } 
         else 
         { 
           //get type of fully embedded fonts 
         name = name.Substring(1); 
         PdfDictionary desc = font.GetAsDict(PdfName.FONTDESCRIPTOR); 
         if (desc == null) 
         name += "no font descriptor"; 
         else if (desc.Get(PdfName.FONTFILE) != null) 
         name += "(Type1) embedded"; 
         else if (desc.Get(PdfName.FONTFILE2) != null) 
         name += "(TrueType) embedded "; 
         else if (desc.Get(PdfName.FONTFILE3) != null) 
         name += name;//("+font.GetASName(PdfName.SUBTYPE).ToString().SubSTring(1)+")embedded'; 
         } 

         names.Add(name); 
         } 
        } 
       } 
      } 
      var collections = from name in names 
      select name; 
      foreach (string fname in collections) 
      { 
      Console.WriteLine(fname); 
      } 
      Console.Read(); 

     } 
    } 
}

我得到的輸出是「Glyphless字體」無字體描述」爲每一個PDF文件作爲輸入的鏈接，輸入文件如下：

https://drive.google.com/open?id=0B6tD8gqVZtLiM3NYMmVVVllNcWc

來源

2016-06-14 Rahul Agnihotri

PdfReader reader = new PdfReader（「C：/Users/agnihotri/Downloads/Test.pdf」）; - 仔細檢查文件的路徑，這可能是問題，因爲代碼看起來不錯。我也強烈建議添加一些調試，如果試圖從互聯網上覆制粘貼，看看他們真的工作。 –

我已經打開。您在Adobe Acrobat PDF和我看字體面板這是我所看到的：

你有一個嵌入式子LiberationMono的設置，這意味着字體的名稱將作爲ABCDEF + LiberationMono（其中ABCDEF是一系列6個隨機但唯一的字符）存儲在文件中，因爲字體是subsetterter。見What are the extra characters in the font name of my PDF?

現在讓我們來看看在iText的RUPS打開同一個文件：

我們發現/Font對象，它有一個/FontDescriptor。在/FontDescriptor中，我們發現/FontName的格式符合我們的預期：BAAAAA+LiberationMono。

現在您知道在哪裏尋找該名稱，您可以調整您的代碼。

來源

2016-06-14 14:13:23

感謝您的澄清....請介意幫助我的代碼。我只是新鮮的Bie編碼和C＃ –

@Rahul，不要放棄在這個早期的時刻！一旦你有這樣的提示，請嘗試應用它 - 這是非常好的做法。 – halfer

不知道我是否正確跟蹤......獲得提示：\t font.GetAsDict（PdfName.FontDescriptor.FontName）; if（desc == null）name + =「no font descriptor」; else if（desc.Get（PdfName.FontName）！= null）name + =「（Type1）embedded」; else if（desc.Get（PdfName.FontName）！= null）name + =「（TrueType）embedded」;否則如果（desc.Get（PdfName.FontName）！= null） –

最小的變化，我得到的輸出

%s subset (%s)

其實%s看起來像一個Java格式的字符串，而不是淨格式字符串運行代碼。使用更.Net'ish格式字符串{0} subset ({1})我得到

LiberationMono subset (BAAAAA+)

我建議你在一個文件路徑，例如用反斜槓和@"..."字符串形式，而不是斜槓這樣

PdfReader reader = new PdfReader(@"C:\Users\agnihotri\Downloads\Test.pdf");

，並仔細檢查文件名和路徑---你所提供的所有文件被命名爲Hello_World.pdf後。

來源

2016-06-14 14:30:21 mkl

謝謝大家的建議和幫助。我已經能夠通過對代碼進行任何更改來解決問題。唯一需要的是使用iTextSharp 5.5。9 DLL，休息一切都很好。這可以標記爲關閉 –

@RahulAgnihotri *唯一需要的是使用iTextSharp 5.5.9 dll * - Hhmmm，因爲您沒有提及您使用的版本，您給出的印象是您一直使用現在的最新版本... *這可以被標記爲關閉* - 您可以自己做：創建一個包含原因的答案（沿着「使用的舊iTextSharp版本，與當前5.5.9良好工作」）並將該答案標記爲接受（點擊左上角的勾號）。將自己的答案標記爲已接受可能不會立即可能，但幾小時後肯定會。 – mkl

如何從PDF獲取文本的字體名稱？

回答

相關問題