從XPS文檔中提取文本

我需要從XPS文檔中提取特定頁面的文本。提取的文本應該寫入一個字符串。我需要使用Microsofts SpeechLib讀出提取的文本。請僅在C＃中使用示例。從XPS文檔中提取文本

感謝

2012-09-04 Tim Trabold

既然您已爲問題爲C＃，因此幾乎所有的答案都將在C＃中，但爲什麼只有C＃。你對其他語言過敏嗎？ –

不，但我的公司在C＃開發，我也必須這樣做 –

那麼，什麼？使用任何其他語言創建，然後使用任何在線轉換器（如http://www.developerfusion.com/tools/convert/csharp-to-vb/#convert-again）將其更改爲您所需的語言。在我的最後一個公司，我用C＃編碼，現在用VB編寫代碼。它（語法）在前兩天是一個問題。 –

添加引用到ReachFramework和WindowsBase及以下using聲明：

using System.Windows.Xps.Packaging;

然後使用此代碼：

XpsDocument _xpsDocument=new XpsDocument("/path",System.IO.FileAccess.Read); 
IXpsFixedDocumentSequenceReader fixedDocSeqReader 
    =_xpsDocument.FixedDocumentSequenceReader; 
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0]; 
IXpsFixedPageReader _page 
    = _document.FixedPages[documentViewerElement.MasterPageNumber]; 
StringBuilder _currentText = new StringBuilder(); 
System.Xml.XmlReader _pageContentReader = _page.XmlReader; 
if (_pageContentReader != null) 
{ 
    while (_pageContentReader.Read()) 
    { 
    if (_pageContentReader.Name == "Glyphs") 
    { 
     if (_pageContentReader.HasAttributes) 
     { 
     if (_pageContentReader.GetAttribute("UnicodeString") != null) 
     {         
      _currentText. 
      Append(_pageContentReader. 
      GetAttribute("UnicodeString"));        
     } 
     } 
    } 
    } 
} 
string _fullPageText = _currentText.ToString();

文本存在Glyphs - >UnicodeString字符串屬性。您必須使用XMLReader作爲固定頁面。

來源

2012-09-05 12:53:42 Sanjay

@Tim Trabold：對於答案的反饋將有所幫助。 – Sanjay

我得到的例外如下：錯誤類型'System.IO.Packaging.Package'在沒有引用的程序集中定義。您必須添加對程序集「WindowsBase，版本= 3.0.0.0，文化=中立，PublicKeyToken = 31bf3856ad364e35」的引用。 – 2013-09-26 05:33:17

+清除它..偉大的工作。 – 2013-09-26 06:30:00

類的全碼：

using System.Collections.Generic; 
using System.Drawing; 
using System.Windows.Forms; 
using System.Windows.Xps.Packaging; 

namespace XPS_Data_Transfer 
{ 
    internal static class XpsDataReader 
    { 
     public static List<string> ReadXps(string address, int pageNumber) 
     { 
      var xpsDocument = new XpsDocument(address, System.IO.FileAccess.Read); 
      var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader; 
      if (fixedDocSeqReader == null) return null; 

      const string uniStr = "UnicodeString"; 
      const string glyphs = "Glyphs"; 
      var document = fixedDocSeqReader.FixedDocuments[pageNumber - 1]; 
      var page = document.FixedPages[0]; 
      var currentText = new List<string>(); 
      var pageContentReader = page.XmlReader; 

      if (pageContentReader == null) return null; 
      while (pageContentReader.Read()) 
      { 
       if (pageContentReader.Name != glyphs) continue; 
       if (!pageContentReader.HasAttributes) continue; 
       if (pageContentReader.GetAttribute(uniStr) != null) 
        currentText.Add(Dashboard.CleanReversedPersianText(pageContentReader.GetAttribute(uniStr))); 
      } 
      return currentText; 
     } 
    } 
}

，從自定義文件的自定義頁面返回字符串數據的列表。

來源

2014-08-09 16:35:34 Amir

Dashboard.CleanReversedPersianText丟失 – salle55

private string ReadXpsFile(string fileName) 
    { 
     XpsDocument _xpsDocument = new XpsDocument(fileName, System.IO.FileAccess.Read); 
     IXpsFixedDocumentSequenceReader fixedDocSeqReader 
      = _xpsDocument.FixedDocumentSequenceReader; 
     IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0]; 
     FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence(); 
     string _fullPageText=""; 
     for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount) 
     { 
      IXpsFixedPageReader _page 
       = _document.FixedPages[pageCount]; 
      StringBuilder _currentText = new StringBuilder(); 
      System.Xml.XmlReader _pageContentReader = _page.XmlReader; 
      if (_pageContentReader != null) 
      { 
       while (_pageContentReader.Read()) 
       { 
        if (_pageContentReader.Name == "Glyphs") 
        { 
         if (_pageContentReader.HasAttributes) 
         { 
          if (_pageContentReader.GetAttribute("UnicodeString") != null) 
          { 
           _currentText. 
            Append(_pageContentReader. 
            GetAttribute("UnicodeString")); 
          } 
         } 
        } 
       } 
      } 
      _fullPageText += _currentText.ToString(); 
     } 
     return _fullPageText; 
    }

來源

2014-08-11 05:03:28 Nurkhan

我得到ArgumentOutOfRangeException使用此代碼，_document.FixedPages只包含一個單一的元素（即使是XPS包含多個頁面）。請參閱：http://i.imgur.com/gpcKxCX.png – salle55

方法，返回所有網頁的文本（修改阿米爾：S碼，希望這是確定）：

/// <summary> 
/// Get all text strings from an XPS file. 
/// Returns a list of lists (one for each page) containing the text strings. 
/// </summary> 
private static List<List<string>> ExtractTextFromXps(string xpsFilePath) 
{ 
    var xpsDocument = new XpsDocument(xpsFilePath, FileAccess.Read); 
    var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader; 
    if (fixedDocSeqReader == null) 
     return null; 

    const string UnicodeString = "UnicodeString"; 
    const string GlyphsString = "Glyphs"; 

    var textLists = new List<List<string>>(); 
    foreach (IXpsFixedDocumentReader fixedDocumentReader in fixedDocSeqReader.FixedDocuments) 
    { 
     foreach (IXpsFixedPageReader pageReader in fixedDocumentReader.FixedPages) 
     { 
     var pageContentReader = pageReader.XmlReader; 
     if (pageContentReader == null) 
      continue; 

     var texts = new List<string>(); 
     while (pageContentReader.Read()) 
     { 
      if (pageContentReader.Name != GlyphsString) 
       continue; 
      if (!pageContentReader.HasAttributes) 
       continue; 
      if (pageContentReader.GetAttribute(UnicodeString) != null) 
       texts.Add(pageContentReader.GetAttribute(UnicodeString)); 
     } 
     textLists.Add(texts); 
     } 
    } 
    xpsDocument.Close(); 
    return textLists; 
}

用法：

var txtLists = ExtractTextFromXps(@"C:\myfile.xps"); 

int pageIdx = 0; 
foreach (List<string> txtList in txtLists) 
{ 
    pageIdx++; 
    Console.WriteLine("== Page {0} ==", pageIdx); 
    foreach (string txt in txtList) 
     Console.WriteLine(" "+txt); 
    Console.WriteLine(); 
}

來源

2017-01-30 16:07:05 salle55

從XPS文檔中提取文本

回答

相關問題