從帶有CID字體的PDF中提取文本

我正在編寫一個Web應用程序，該應用程序在PDF中的每個頁面的頂部提取一行。 PDF文件來自不同版本的產品，可以通過多種PDF打印機，也可以使用不同的版本和不同的設置。從帶有CID字體的PDF中提取文本

到目前爲止使用PDFSharp和iTextSharp我已經設法讓它適用於所有版本的PDF。我的掛斷文件有CID字體（Identity-H）。

我已經寫了一個部分解析器來查找字體表引用和文本塊，但將它們轉換爲可讀文本正在擊敗我。

有沒有人有： - 解析器（如這個https://stackoverflow.com/a/1732265/5169050），對付CID字體;或 - 如何分析一個網頁資源字典，找到網頁的字體，並得到其ToUnicode流，以幫助玩完這個例子中的一些示例代碼（https://stackoverflow.com/a/4048328/5169050）

我們必須使用iTextSharp的4.1保留了自由TO-使用許可證。

這是我的部分解析器。

public string ExtractTextFromCIDPDFBytes(byte[] input) 
{ 
    if (input == null || input.Length == 0) return ""; 

    try 
    { 
     // Holds the final result to be returned 
     string resultString = ""; 
     // Are we in a block of text or not 
     bool blnInText = false; 
     // Holds each line of text before written to resultString 
     string phrase = ""; 
     // Holds the 4-character hex codes as they are built 
     string hexCode = ""; 
     // Are we in a font reference or not (much like a code block) 
     bool blnInFontRef = false; 
     // Holds the last font reference and therefore the CMAP table 
     // to be used for any text found after it 
     string currentFontRef = ""; 

     for (int i = 0; i < input.Length; i++) 
     { 
      char c = (char)input[i]; 

      switch (c) 
      { 
       case '<': 
        { 
         blnInText = true; 
         break; 
        } 
       case '>': 
        { 
         resultString = resultString + Environment.NewLine + phrase; 
         phrase = ""; 
         blnInText = false; 
         break; 
        } 
       case 'T': 
        { 
         switch (((char)input[i + 1]).ToString().ToLower()) 
         { 
          case "f": 
           { 
            // Tf represents the start of a font table reference 
            blnInFontRef = true; 
            currentFontRef = ""; 
            break; 
           } 
          case "d": 
           { 
            // Td represents the end of a font table reference or 
            // the start of a text block 
            blnInFontRef = false; 
            break; 
           } 
         } 
         break; 
        } 
       default: 
        { 
         if (blnInText) 
         { 
          // We are looking for 4-character blocks of hex characters 
          // These will build up a number which refers to the index 
          // of the glyph in the CMAP table, which will give us the 
          // character 
          hexCode = hexCode + c; 
          if (hexCode.Length == 4) 
          { 
           // TODO - translate code to character 
           char translatedHexCode = c; 



           phrase = phrase + translatedHexCode; 
           // Blank it out ready for the next 4 
           hexCode = ""; 
          } 
         } 
         else 
         { 
          if (blnInFontRef) 
          { 
           currentFontRef = currentFontRef + c; 
          } 
         } 
         break; 
        } 
      } 
     } 

     return resultString; 
    } 
    catch 
    { 
     return ""; 
    } 
}

來源

2015-10-29 Red

軌跡從頁面資源到'/ ToUnicode'在Adobe的PDF參考描述。它基本上是查找鏈'/ Page'>'/ Resources'>'/ Font'>'/ ToUnicode'（大多數查找都是對象引用）。解析ToUnicode映射併爲您的字體創建一個新的cmap。 – usr2564301

*我們必須使用iTextSharp 4.1來保留免費使用的許可證。*您應該參加我的[JavaOne talk]（http://www.slideshare.net/iTextPDF/ianal-what-developers-should-知道關於ip和法律）yesteday。然後你就會明白你爲什麼說錯了。 –

謝謝Jongware，我現在可以看到我認爲需要的結構和項目。我努力想要得到任何東西的東西，但我覺得我錯過了一些基本的東西。從PdfPage調用頁面下面的每個工作在調試和給出結果，將執行沒有錯誤，但所有將返回'名稱'a'在當前上下文中不存在' PdfDictionary.DictionaryElements a = page。元素; PdfItem b = page.Elements [「/ Resources」]; PdfDictionary c =（PdfDictionary）頁面。元素[ 「/資源」]; 布魯諾，恥辱，我錯過了 – Red

它花了一段時間，但我終於有一些代碼來從Identity-H編碼PDF中讀取純文本。我在這裏發佈它來幫助別人，我知道會有辦法改進它。例如，我沒有觸及字符映射（beginbfchar），我的範圍實際上並不是範圍。我已經花了一個多星期的時間了，除非我們打到工作方式不同的文件，否則無法證明時間。抱歉。

用法：

PdfDocument inputDocument = PDFHelpers.Open(physcialFilePath, PdfDocumentOpenMode.Import) 
foreach (PdfPage page in inputDocument.Pages) 
{ 
    for (Int32 index = 0; index < page.Contents.Elements.Count; index++) 
    { 
     PdfDictionary.PdfStream stream = page.Contents.Elements.GetDictionary(index).Stream; 
     String outputText = new PDFParser().ExtractTextFromPDFBytes(stream.Value).Replace(" ", String.Empty); 

     if (outputText == "" || outputText.Replace("\n\r", "") == "") 
     { 
      // Identity-H encoded file 
      string[] hierarchy = new string[] { "/Resources", "/Font", "/F*" }; 
      List<PdfItem> fonts = PDFHelpers.FindObjects(hierarchy, page, true); 
      outputText = PDFHelpers.FromUnicode(stream, fonts); 
     } 
    } 
}

和實際的輔助類，其中我會後的全部，因爲他們都在本例中使用，是因爲我發現這麼幾個完整的例子，當我自己我試圖解決這個問題。助手使用PDFSharp和iTextSharp能夠在1.5之前和之後打開PDF，ExtractTextFromPDFBytes以標準PDF讀取，我的FindObjects（搜索文檔樹和返回對象）和FromUnicode獲取加密文本和一個字體集合來翻譯它。

using PdfSharp.Pdf; 
using PdfSharp.Pdf.Content; 
using PdfSharp.Pdf.Content.Objects; 
using System; 
using System.Collections.Generic; 
using System.IO; 

namespace PdfSharp.Pdf.IO 
{ 
    /// <summary> 
    /// uses itextsharp 4.1.6 to convert any pdf to 1.4 compatible pdf, called instead of PdfReader.open 
    /// </summary> 
    static public class PDFHelpers 
    { 
     /// <summary> 
     /// uses itextsharp 4.1.6 to convert any pdf to 1.4 compatible pdf, called instead of PdfReader.open 
     /// </summary> 
     static public PdfDocument Open(string PdfPath, PdfDocumentOpenMode openmode) 
     { 
      return Open(PdfPath, null, openmode); 
     } 
     /// <summary> 
     /// uses itextsharp 4.1.6 to convert any pdf to 1.4 compatible pdf, called instead of PdfReader.open 
     /// </summary> 
     static public PdfDocument Open(string PdfPath, string password, PdfDocumentOpenMode openmode) 
     { 
      using (FileStream fileStream = new FileStream(PdfPath, FileMode.Open, FileAccess.Read)) 
      { 
       int len = (int)fileStream.Length; 
       // TODO: Setting this byteArray causes the out of memory exception which is why we 
       // have the 70mb limit. Solve this and we can increase the file size limit 
       System.Diagnostics.Process proc = System.Diagnostics.Process.GetCurrentProcess(); 
       long availableMemory = proc.PrivateMemorySize64/1024/1024; //Mb of RAM allocated to this process that cannot be shared with other processes 
       if (availableMemory < (fileStream.Length/1024/1024)) 
       { 
        throw new Exception("The available memory " + availableMemory + "Mb is not enough to open, split and save a file of " + fileStream.Length/1024/1024); 
       } 

       try 
       { 
        Byte[] fileArray = new Byte[len]; 
        fileStream.Read(fileArray, 0, len); 
        fileStream.Close(); 
        fileStream.Dispose(); 


        PdfDocument result = Open(fileArray, openmode); 
        if (result.FullPath == "") 
        { 
         // The file was converted to a v1.4 document and only exists as a document in memory 
         // Save over the original file so other references to the file get the compatible version 
         // TODO: It would be good if we could do this conversion without opening every document another 2 times 
         PdfDocument tempResult = Open(fileArray, PdfDocumentOpenMode.Modify); 

         iTextSharp.text.pdf.BaseFont bfR = iTextSharp.text.pdf.BaseFont.CreateFont(Environment.GetEnvironmentVariable("SystemRoot") + "\\fonts\\arial.ttf", iTextSharp.text.pdf.BaseFont.IDENTITY_H, iTextSharp.text.pdf.BaseFont.EMBEDDED); 
         bfR.Subset = false; 

         tempResult.Save(PdfPath); 
         tempResult.Close(); 
         tempResult.Dispose(); 
         result = Open(fileArray, openmode); 
        } 

        return result; 
       } 
       catch (OutOfMemoryException) 
       { 
        fileStream.Close(); 
        fileStream.Dispose(); 

        throw; 
       } 
      } 
     } 

     /// <summary> 
     /// uses itextsharp 4.1.6 to convert any pdf to 1.4 compatible pdf, called instead of PdfReader.open 
     /// </summary> 
     static public PdfDocument Open(byte[] fileArray, PdfDocumentOpenMode openmode) 
     { 
      return Open(new MemoryStream(fileArray), null, openmode); 
     } 
     /// <summary> 
     /// uses itextsharp 4.1.6 to convert any pdf to 1.4 compatible pdf, called instead of PdfReader.open 
     /// </summary> 
     static public PdfDocument Open(byte[] fileArray, string password, PdfDocumentOpenMode openmode) 
     { 
      return Open(new MemoryStream(fileArray), password, openmode); 
     } 

     /// <summary> 
     /// uses itextsharp 4.1.6 to convert any pdf to 1.4 compatible pdf, called instead of PdfReader.open 
     /// </summary> 
     static public PdfDocument Open(MemoryStream sourceStream, PdfDocumentOpenMode openmode) 
     { 
      return Open(sourceStream, null, openmode); 
     } 
     /// <summary> 
     /// uses itextsharp 4.1.6 to convert any pdf to 1.4 compatible pdf, called instead of PdfReader.open 
     /// </summary> 
     static public PdfDocument Open(MemoryStream sourceStream, string password, PdfDocumentOpenMode openmode) 
     { 
      PdfDocument outDoc = null; 
      sourceStream.Position = 0; 

      try 
      { 
       outDoc = (password == null) ? 
         PdfReader.Open(sourceStream, openmode) : 
         PdfReader.Open(sourceStream, password, openmode); 

       sourceStream.Position = 0; 
       MemoryStream outputStream = new MemoryStream(); 
       iTextSharp.text.pdf.PdfReader reader = (password == null) ? 
         new iTextSharp.text.pdf.PdfReader(sourceStream) : 
         new iTextSharp.text.pdf.PdfReader(sourceStream, System.Text.ASCIIEncoding.ASCII.GetBytes(password)); 
       System.Collections.ArrayList fontList = iTextSharp.text.pdf.BaseFont.GetDocumentFonts(reader, 1); 
      } 
      catch (PdfSharp.Pdf.IO.PdfReaderException) 
      { 
       //workaround if pdfsharp doesn't support this pdf 
       sourceStream.Position = 0; 
       MemoryStream outputStream = new MemoryStream(); 
       iTextSharp.text.pdf.PdfReader reader = (password == null) ? 
         new iTextSharp.text.pdf.PdfReader(sourceStream) : 
         new iTextSharp.text.pdf.PdfReader(sourceStream, System.Text.ASCIIEncoding.ASCII.GetBytes(password)); 
       iTextSharp.text.pdf.PdfStamper pdfStamper = new iTextSharp.text.pdf.PdfStamper(reader, outputStream); 
       pdfStamper.FormFlattening = true; 
       pdfStamper.Writer.SetPdfVersion(iTextSharp.text.pdf.PdfWriter.PDF_VERSION_1_4); 
       pdfStamper.Writer.CloseStream = false; 
       pdfStamper.Close(); 

       outDoc = PdfReader.Open(outputStream, openmode); 
      } 

      return outDoc; 
     } 

     /// <summary> 
     /// Uses a recurrsive function to step through the PDF document tree to find the specified objects. 
     /// </summary> 
     /// <param name="objectHierarchy">An array of the names of objects to look for in the tree. Wildcards can be used in element names, e.g., /F*. The order represents 
     /// a top-down hierarchy if followHierarchy is true. 
     /// If a single object is passed in array it should be in the level below startingObject, or followHierarchy set to false to find it anywhere in the tree</param> 
     /// <param name="startingObject">A PDF object to parse. This will likely be a document or a page, but could be any lower-level item</param> 
     /// <param name="followHierarchy">If true the order of names in the objectHierarchy will be used to search only that branch. If false the whole tree will be parsed for 
     /// any items matching those in objectHierarchy regardless of position</param> 
     static public List<PdfItem> FindObjects(string[] objectHierarchy, PdfItem startingObject, bool followHierarchy) 
     { 
      List<PdfItem> results = new List<PdfItem>(); 
      FindObjects(objectHierarchy, startingObject, followHierarchy, ref results, 0); 
      return results; 
     } 

     static private void FindObjects(string[] objectHierarchy, PdfItem startingObject, bool followHierarchy, ref List<PdfItem> results, int Level) 
     { 
      PdfName[] keyNames = ((PdfDictionary)startingObject).Elements.KeyNames; 
      foreach (PdfName keyName in keyNames) 
      { 
       bool matchFound = false; 
       if (!followHierarchy) 
       { 
        // We need to check all items for a match, not just the top one 
        for (int i = 0; i < objectHierarchy.Length; i++) 
        { 
         if (keyName.Value == objectHierarchy[i] || 
          (objectHierarchy[i].Contains("*") && 
           (keyName.Value.StartsWith(objectHierarchy[i].Substring(0, objectHierarchy[i].IndexOf("*") - 1)) && 
           keyName.Value.EndsWith(objectHierarchy[i].Substring(objectHierarchy[i].IndexOf("*") + 1))))) 
         { 
          matchFound = true; 
         } 
        } 
       } 
       else 
       { 
        // Check the item in the hierarchy at this level for a match 
        if (Level < objectHierarchy.Length && (keyName.Value == objectHierarchy[Level] || 
         (objectHierarchy[Level].Contains("*") && 
           (keyName.Value.StartsWith(objectHierarchy[Level].Substring(0, objectHierarchy[Level].IndexOf("*") - 1)) && 
           keyName.Value.EndsWith(objectHierarchy[Level].Substring(objectHierarchy[Level].IndexOf("*") + 1)))))) 
        { 
         matchFound = true; 
        } 
       } 

       if (matchFound) 
       { 
        PdfItem item = ((PdfDictionary)startingObject).Elements[keyName]; 
        if (item != null && item is PdfSharp.Pdf.Advanced.PdfReference) 
        { 
         item = ((PdfSharp.Pdf.Advanced.PdfReference)item).Value; 
        } 

        System.Diagnostics.Debug.WriteLine("Level " + Level.ToString() + " - " + keyName.ToString() + " matched"); 

        if (Level == objectHierarchy.Length - 1) 
        { 
         // We are at the end of the hierarchy, so this is the target 
         results.Add(item); 
        } 
        else if (!followHierarchy) 
        { 
         // We are returning every matching object so add it 
         results.Add(item); 
        } 

        // Call back to this function to search lower levels 
        Level++; 
        FindObjects(objectHierarchy, item, followHierarchy, ref results, Level); 
        Level--; 
       } 
       else 
       { 
        System.Diagnostics.Debug.WriteLine("Level " + Level.ToString() + " - " + keyName.ToString() + " unmatched"); 
       } 
      } 
      Level--; 
      System.Diagnostics.Debug.WriteLine("Level " + Level.ToString()); 
     } 

     /// <summary> 
     /// Uses the Font object to translate CID encoded text to readable text 
     /// </summary> 
     /// <param name="unreadableText">The text stream that needs to be decoded</param> 
     /// <param name="font">A List of PDFItems containing the /Font object containing a /ToUnicode with a CMap</param> 
     static public string FromUnicode(PdfDictionary.PdfStream unreadableText, List<PdfItem> PDFFonts) 
     { 
      Dictionary<string, string[]> fonts = new Dictionary<string, string[]>(); 

      // Get the CMap from each font in the passed array and store them by font name 
      for (int font = 0; font < PDFFonts.Count; font++) 
      { 
       PdfName[] keyNames = ((PdfDictionary)PDFFonts[font]).Elements.KeyNames; 
       foreach (PdfName keyName in keyNames) 
       { 
        if (keyName.Value == "/ToUnicode") { 
         PdfItem item = ((PdfDictionary)PDFFonts[font]).Elements[keyName]; 
         if (item != null && item is PdfSharp.Pdf.Advanced.PdfReference) 
         { 
          item = ((PdfSharp.Pdf.Advanced.PdfReference)item).Value; 
         } 
         string FontName = "/F" + font.ToString(); 
         string CMap = ((PdfDictionary)item).Stream.ToString(); 

         if (CMap.IndexOf("beginbfrange") > 0) 
         { 
          CMap = CMap.Substring(CMap.IndexOf("beginbfrange") + "beginbfrange".Length); 

          if (CMap.IndexOf("endbfrange") > 0) 
          { 
           CMap = CMap.Substring(0, CMap.IndexOf("endbfrange") - 1); 

           string[] CMapArray = CMap.Split(new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries); 
           fonts.Add(FontName, CMapArray); 
          } 
         } 
         break; 
        } 
       } 
      } 

      // Holds the final result to be returned 
      string resultString = ""; 

      // Break the input text into lines 
      string[] lines = unreadableText.ToString().Split(new string[] {"\n"} , StringSplitOptions.RemoveEmptyEntries); 

      // Holds the last font reference and therefore the CMAP table 
      // to be used for any text found after it 
      string[] currentFontRef = fonts["/F0"]; 

      // Are we in a block of text or not? They can break across lines so we need an identifier 
      bool blnInText = false; 

      for (int line = 0; line < lines.Length; line++) 
      { 
       string thisLine = lines[line].Trim(); 

       if (thisLine == "q") 
       { 
        // I think this denotes the start of a text block, and where we need to reset to the default font 
        currentFontRef = fonts["/F0"]; 
       } 
       else if (thisLine.IndexOf(" Td <") != -1) 
       { 
        thisLine = thisLine.Substring(thisLine.IndexOf(" Td <") + 5); 
        blnInText = true; 
       } 

       if (thisLine.EndsWith("Tf")) 
       { 
        // This is a font assignment. Take note of this and use this fonts ToUnicode map when we find text 
        if (fonts.ContainsKey(thisLine.Substring(0, thisLine.IndexOf(" ")))) 
        { 
         currentFontRef = fonts[thisLine.Substring(0, thisLine.IndexOf(" "))]; 
        } 
       } 
       else if (thisLine.EndsWith("> Tj")) 
       { 
        thisLine = thisLine.Substring(0, thisLine.IndexOf("> Tj")); 
       } 

       if(blnInText) 
       { 
        // This is a text block 
        try 
        { 
         // Get the section of codes that exist between angled brackets 
         string unicodeStr = thisLine; 
         // Wrap every group of 4 characters in angle brackets 
         // This will directly match the items in the CMap but also allows the next for to avoid double-translating items 
         unicodeStr = "<" + String.Join("><", unicodeStr.SplitInParts(4)) + ">"; 

         for (int transform = 0; transform < currentFontRef.Length; transform++) 
         { 
          // Get the last item in the line, which is the unicode value of the glyph 
          string glyph = currentFontRef[transform].Substring(currentFontRef[transform].IndexOf("<")); 
          glyph = glyph.Substring(0, glyph.IndexOf(">") + 1); 

          string counterpart = currentFontRef[transform].Substring(currentFontRef[transform].LastIndexOf("<") + 1); 
          counterpart = counterpart.Substring(0, counterpart.LastIndexOf(">")); 

          // Replace each item that matches with the translated counterpart 
          // Insert a \\u before every 4th character so it's a C# unicode compatible string 
          unicodeStr = unicodeStr.Replace(glyph, "\\u" + counterpart); 
          if (unicodeStr.IndexOf(">") == 0) 
          { 
           // All items have been replaced, so lets get outta here 
           break; 
          } 
         } 
         resultString = resultString + System.Text.RegularExpressions.Regex.Unescape(unicodeStr); 
        } 
        catch 
        { 
         return ""; 
        } 
       } 

       if (lines[line].Trim().EndsWith("> Tj")) 
       { 
        blnInText = false; 
        if (lines[line].Trim().IndexOf(" 0 Td <") == -1) 
        { 
         // The vertical coords have changed, so add a new line 
         resultString = resultString + Environment.NewLine; 
        } 
        else 
        { 
         resultString = resultString + " "; 
        } 
       } 
      } 
      return resultString; 
     } 

     // Credit to http://stackoverflow.com/questions/4133377/ 
     private static IEnumerable<String> SplitInParts(this String s, Int32 partLength) 
     { 
      if (s == null) 
       throw new ArgumentNullException("s"); 
      if (partLength <= 0) 
       throw new ArgumentException("Part length has to be positive.", "partLength"); 

      for (var i = 0; i < s.Length; i += partLength) 
       yield return s.Substring(i, Math.Min(partLength, s.Length - i)); 
     } 
    } 
} 

public class PDFParser 
{ 
    /// BT = Beginning of a text object operator 
    /// ET = End of a text object operator 
    /// Td move to the start of next line 
    /// 5 Ts = superscript 
    /// -5 Ts = subscript 

    #region Fields 

    #region _numberOfCharsToKeep 
    /// <summary> 
    /// The number of characters to keep, when extracting text. 
    /// </summary> 
    private static int _numberOfCharsToKeep = 15; 
    #endregion 

    #endregion 



    #region ExtractTextFromPDFBytes 
    /// <summary> 
    /// This method processes an uncompressed Adobe (text) object 
    /// and extracts text. 
    /// </summary> 
    /// <param name="input">uncompressed</param> 
    /// <returns></returns> 
    public string ExtractTextFromPDFBytes(byte[] input) 
    { 
     if (input == null || input.Length == 0) return ""; 

     try 
     { 
      string resultString = ""; 

      // Flag showing if we are we currently inside a text object 
      bool inTextObject = false; 

      // Flag showing if the next character is literal 
      // e.g. '\\' to get a '\' character or '\(' to get '(' 
      bool nextLiteral = false; 

      //() Bracket nesting level. Text appears inside() 
      int bracketDepth = 0; 

      // Keep previous chars to get extract numbers etc.: 
      char[] previousCharacters = new char[_numberOfCharsToKeep]; 
      for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' '; 


      for (int i = 0; i < input.Length; i++) 
      { 
       char c = (char)input[i]; 

       if (inTextObject) 
       { 
        // Position the text 
        if (bracketDepth == 0) 
        { 
         if (CheckToken(new string[] { "TD", "Td" }, previousCharacters)) 
         { 
          resultString += "\n\r"; 
         } 
         else 
         { 
          if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters)) 
          { 
           resultString += "\n"; 
          } 
          else 
          { 
           if (CheckToken(new string[] { "Tj" }, previousCharacters)) 
           { 
            resultString += " "; 
           } 
          } 
         } 
        } 

        // End of a text object, also go to a new line. 
        if (bracketDepth == 0 && 
         CheckToken(new string[] { "ET" }, previousCharacters)) 
        { 

         inTextObject = false; 
         resultString += " "; 
        } 
        else 
        { 
         // Start outputting text 
         if ((c == '(') && (bracketDepth == 0) && (!nextLiteral)) 
         { 
          bracketDepth = 1; 
         } 
         else 
         { 
          // Stop outputting text 
          if ((c == ')') && (bracketDepth == 1) && (!nextLiteral)) 
          { 
           bracketDepth = 0; 
          } 
          else 
          { 
           // Just a normal text character: 
           if (bracketDepth == 1) 
           { 
            // Only print out next character no matter what. 
            // Do not interpret. 
            if (c == '\\' && !nextLiteral) 
            { 
             nextLiteral = true; 
            } 
            else 
            { 
             if (((c >= ' ') && (c <= '~')) || 
              ((c >= 128) && (c < 255))) 
             { 
              resultString += c.ToString(); 
             } 

             nextLiteral = false; 
            } 
           } 
          } 
         } 
        } 
       } 

       // Store the recent characters for 
       // when we have to go back for a checking 
       for (int j = 0; j < _numberOfCharsToKeep - 1; j++) 
       { 
        previousCharacters[j] = previousCharacters[j + 1]; 
       } 
       previousCharacters[_numberOfCharsToKeep - 1] = c; 

       // Start of a text object 
       if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters)) 
       { 
        inTextObject = true; 
       } 
      } 
      return resultString; 
     } 
     catch 
     { 
      return ""; 
     } 
    } 
    #endregion 

    #region CheckToken 
    /// <summary> 
    /// Check if a certain 2 character token just came along (e.g. BT) 
    /// </summary> 
    /// <param name="search">the searched token</param> 
    /// <param name="recent">the recent character array</param> 
    /// <returns></returns> 
    private bool CheckToken(string[] tokens, char[] recent) 
    { 
     foreach (string token in tokens) 
     { 
      if (token.Length > 1) 
      { 
       if ((recent[_numberOfCharsToKeep - 3] == token[0]) && 
        (recent[_numberOfCharsToKeep - 2] == token[1]) && 
        ((recent[_numberOfCharsToKeep - 1] == ' ') || 
        (recent[_numberOfCharsToKeep - 1] == 0x0d) || 
        (recent[_numberOfCharsToKeep - 1] == 0x0a)) && 
        ((recent[_numberOfCharsToKeep - 4] == ' ') || 
        (recent[_numberOfCharsToKeep - 4] == 0x0d) || 
        (recent[_numberOfCharsToKeep - 4] == 0x0a)) 
        ) 
       { 
        return true; 
       } 
      } 
      else 
      { 
       return false; 
      } 

     } 
     return false; 
    } 
    #endregion 
}

感謝所有那些誰提供幫助和片段，讓我終於拉了有效的解決方案一起

來源

2015-11-04 15:44:03 Red

爲什麼你只考慮字體名有** f **開始？ '串[]層次=新的字符串[] {「/資源」，「/字體」，「/ F *」};' – mkl

下/字體對象是與增量從零開始的名稱，例如字體的集合，/ F0，/ F1，/ F2。所以/ F *表示任何以「/ F」開頭。它不涉及字體名稱。據我所知，字體不被存儲爲其他任何東西。但是，如果他們是你可以調整搜索參數 – Red

*在/字體對象是一個字體的集合增量爲零的名稱，例如/ F0，/ F1，/ F2 * - 這些名稱** F0 **，* * F1 **，...不是必需的，他們也可以被命名爲** Helv **，** Arial **，** Bobo **，無論如何。 – mkl

從帶有CID字體的PDF中提取文本

回答

相關問題