2012-09-04 288 views
1

我需要從XPS文檔中提取特定頁面的文本。 提取的文本應該寫入一個字符串。我需要使用Microsofts SpeechLib讀出提取的文本。 請僅在C#中使用示例。從XPS文檔中提取文本

感謝

+0

既然您已爲問題爲C#,因此幾乎所有的答案都將在C#中,但爲什麼只有C#。你對其他語言過敏嗎? –

+0

不,但我的公司在C#開發,我也必須這樣做 –

+0

那麼,什麼?使用任何其他語言創建,然後使用任何在線轉換器(如http://www.developerfusion.com/tools/convert/csharp-to-vb/#convert-again)將其更改爲您所需的語言。在我的最後一個公司,我用C#編碼,現在用VB編寫代碼。它(語法)在前兩天是一個問題。 –

回答

9

添加引用到ReachFrameworkWindowsBase及以下using聲明:

using System.Windows.Xps.Packaging; 

然後使用此代碼:

XpsDocument _xpsDocument=new XpsDocument("/path",System.IO.FileAccess.Read); 
IXpsFixedDocumentSequenceReader fixedDocSeqReader 
    =_xpsDocument.FixedDocumentSequenceReader; 
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0]; 
IXpsFixedPageReader _page 
    = _document.FixedPages[documentViewerElement.MasterPageNumber]; 
StringBuilder _currentText = new StringBuilder(); 
System.Xml.XmlReader _pageContentReader = _page.XmlReader; 
if (_pageContentReader != null) 
{ 
    while (_pageContentReader.Read()) 
    { 
    if (_pageContentReader.Name == "Glyphs") 
    { 
     if (_pageContentReader.HasAttributes) 
     { 
     if (_pageContentReader.GetAttribute("UnicodeString") != null) 
     {         
      _currentText. 
      Append(_pageContentReader. 
      GetAttribute("UnicodeString"));        
     } 
     } 
    } 
    } 
} 
string _fullPageText = _currentText.ToString(); 

文本存在Glyphs - >UnicodeString字符串屬性。您必須使用XMLReader作爲固定頁面。

+2

@Tim Trabold:對於答案的反饋將有所幫助。 – Sanjay

+0

我得到的例外如下:錯誤類型'System.IO.Packaging.Package'在沒有引用的程序集中定義。您必須添加對程序集「WindowsBase,版本= 3.0.0.0,文化=中立,PublicKeyToken = 31bf3856ad364e35」的引用。 – 2013-09-26 05:33:17

+0

+清除它..偉大的工作。 – 2013-09-26 06:30:00

0

類的全碼:

using System.Collections.Generic; 
using System.Drawing; 
using System.Windows.Forms; 
using System.Windows.Xps.Packaging; 

namespace XPS_Data_Transfer 
{ 
    internal static class XpsDataReader 
    { 
     public static List<string> ReadXps(string address, int pageNumber) 
     { 
      var xpsDocument = new XpsDocument(address, System.IO.FileAccess.Read); 
      var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader; 
      if (fixedDocSeqReader == null) return null; 

      const string uniStr = "UnicodeString"; 
      const string glyphs = "Glyphs"; 
      var document = fixedDocSeqReader.FixedDocuments[pageNumber - 1]; 
      var page = document.FixedPages[0]; 
      var currentText = new List<string>(); 
      var pageContentReader = page.XmlReader; 

      if (pageContentReader == null) return null; 
      while (pageContentReader.Read()) 
      { 
       if (pageContentReader.Name != glyphs) continue; 
       if (!pageContentReader.HasAttributes) continue; 
       if (pageContentReader.GetAttribute(uniStr) != null) 
        currentText.Add(Dashboard.CleanReversedPersianText(pageContentReader.GetAttribute(uniStr))); 
      } 
      return currentText; 
     } 
    } 
} 

,從自定義文件的自定義頁面返回字符串數據的列表。

+0

Dashboard.CleanReversedPersianText丟失 – salle55

0
private string ReadXpsFile(string fileName) 
    { 
     XpsDocument _xpsDocument = new XpsDocument(fileName, System.IO.FileAccess.Read); 
     IXpsFixedDocumentSequenceReader fixedDocSeqReader 
      = _xpsDocument.FixedDocumentSequenceReader; 
     IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0]; 
     FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence(); 
     string _fullPageText=""; 
     for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount) 
     { 
      IXpsFixedPageReader _page 
       = _document.FixedPages[pageCount]; 
      StringBuilder _currentText = new StringBuilder(); 
      System.Xml.XmlReader _pageContentReader = _page.XmlReader; 
      if (_pageContentReader != null) 
      { 
       while (_pageContentReader.Read()) 
       { 
        if (_pageContentReader.Name == "Glyphs") 
        { 
         if (_pageContentReader.HasAttributes) 
         { 
          if (_pageContentReader.GetAttribute("UnicodeString") != null) 
          { 
           _currentText. 
            Append(_pageContentReader. 
            GetAttribute("UnicodeString")); 
          } 
         } 
        } 
       } 
      } 
      _fullPageText += _currentText.ToString(); 
     } 
     return _fullPageText; 
    } 
+0

我得到ArgumentOutOfRangeException使用此代碼,_document.FixedPages只包含一個單一的元素(即使是XPS包含多個頁面)。請參閱:http://i.imgur.com/gpcKxCX.png – salle55

0

方法,返回所有網頁的文本(修改阿米爾:S碼,希望這是確定):

/// <summary> 
/// Get all text strings from an XPS file. 
/// Returns a list of lists (one for each page) containing the text strings. 
/// </summary> 
private static List<List<string>> ExtractTextFromXps(string xpsFilePath) 
{ 
    var xpsDocument = new XpsDocument(xpsFilePath, FileAccess.Read); 
    var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader; 
    if (fixedDocSeqReader == null) 
     return null; 

    const string UnicodeString = "UnicodeString"; 
    const string GlyphsString = "Glyphs"; 

    var textLists = new List<List<string>>(); 
    foreach (IXpsFixedDocumentReader fixedDocumentReader in fixedDocSeqReader.FixedDocuments) 
    { 
     foreach (IXpsFixedPageReader pageReader in fixedDocumentReader.FixedPages) 
     { 
     var pageContentReader = pageReader.XmlReader; 
     if (pageContentReader == null) 
      continue; 

     var texts = new List<string>(); 
     while (pageContentReader.Read()) 
     { 
      if (pageContentReader.Name != GlyphsString) 
       continue; 
      if (!pageContentReader.HasAttributes) 
       continue; 
      if (pageContentReader.GetAttribute(UnicodeString) != null) 
       texts.Add(pageContentReader.GetAttribute(UnicodeString)); 
     } 
     textLists.Add(texts); 
     } 
    } 
    xpsDocument.Close(); 
    return textLists; 
} 

用法:

var txtLists = ExtractTextFromXps(@"C:\myfile.xps"); 

int pageIdx = 0; 
foreach (List<string> txtList in txtLists) 
{ 
    pageIdx++; 
    Console.WriteLine("== Page {0} ==", pageIdx); 
    foreach (string txt in txtList) 
     Console.WriteLine(" "+txt); 
    Console.WriteLine(); 
}