2017-10-06 59 views
0

我想問如何將MS Word Document(doc/docx) 的頁眉/頁腳部分轉換爲HTML。 我打開文檔,如:導出docx/doc作爲docx文件的第一個頁眉和頁腳使用openXML

using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true)) 

又名OpenXML的

我轉換文檔與WmlToHtmlConverter其優秀的文檔轉換隻是頁眉和頁腳skipt辯論,因爲HTML非標準犯規支持分頁。我想知道如何獲取它們並將它們解壓縮爲html。 我是讓喜歡嘗試:

using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(mainFileMemoryStream, true)) 
{ 
    Document mainPart = wdDoc.MainDocumentPart.Document; 
    DocumentFormat.OpenXml.Packaging.HeaderPart firstHeader = 
      wdDoc.MainDocumentPart.HeaderParts.FirstOrDefault(); 

    if (firstHeader != null) 
    { 
     using (var headerStream = firstHeader.GetStream()) 
     { 
      return headerStream.ReadFully(); 
     } 
    } 
    return null; 
} 

,然後將它傳遞給皈依功能,但我得到的例外它說:

文件包含已損壞的數據,帶有堆棧跟蹤:

at System.IO.Packaging.ZipPackage..ctor(Stream s, FileMode packageFileMode, FileAccess packageFileAccess) 
at System.IO.Packaging.Package.Open(Stream stream, FileMode packageMode, FileAccess packageAccess) 
at DocumentFormat.OpenXml.Packaging.OpenXmlPackage.OpenCore(Stream stream, Boolean readWriteMode) 
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable, OpenSettings openSettings) 
at DocumentFormat.OpenXml.Packaging.WordprocessingDocument.Open(Stream stream, Boolean isEditable) 
at DocxToHTML.Converter.HTMLConverter.ParseDOCX(Byte[] fileInfo, String fileName) in D:\eTemida\eTemida.Web\DocxToHTML.Converter\HTMLConverter.cs:line 96 

任何幫助將不勝感激

+0

嗨,沒有直接的方法在OpenXML中獲取頁眉和頁腳作爲HTML(即在OpenXML powertools中),而不是必須將文本的頁眉和頁腳內容作爲文本讀取,那麼您必須爲該文本應用樣式標題文本。請參閱:https:// github。com/OfficeDev/Open-Xml-PowerTools/issues/66#issuecomment-326629828 –

回答

0

了很多的鬥爭使我以下解決方案:

我創建了一個功能,用於將DOCX文獻的字節數組的Html如下

public string ConvertToHtml(byte[] fileInfo, string fileName = "Default.docx") 
    { 
     if (string.IsNullOrEmpty(fileName) || Path.GetExtension(fileName) != ".docx") 
      return "Unsupported format"; 

     //FileInfo fileInfo = new FileInfo(fullFilePath); 

     string htmlText = string.Empty; 
     try 
     { 
      htmlText = ParseDOCX(fileInfo, fileName); 
     } 
     catch (OpenXmlPackageException e) 
     { 

      if (e.ToString().Contains("Invalid Hyperlink")) 
      { 
       using (MemoryStream fs = new MemoryStream(fileInfo)) 
       { 
        UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri)); 
       } 
       htmlText = ParseDOCX(fileInfo, fileName); 
      } 
     } 
     return htmlText; 
    } 

凡ParseDOCX做所有的皈依。 ParseDOCX的代碼:

private string ParseDOCX(byte[] fileInfo, string fileName) 
    { 
     try 
     { 
      //byte[] byteArray = File.ReadAllBytes(fileInfo.FullName); 
      using (MemoryStream memoryStream = new MemoryStream()) 
      { 
       memoryStream.Write(fileInfo, 0, fileInfo.Length); 

       using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true)) 
       { 

        int imageCounter = 0; 

        var pageTitle = fileName; 
        var part = wDoc.CoreFilePropertiesPart; 
        if (part != null) 
         pageTitle = (string)part.GetXDocument().Descendants(DC.title).FirstOrDefault() ?? fileName; 

        WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings() 
        { 
         AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }", 
         PageTitle = pageTitle, 
         FabricateCssClasses = true, 
         CssClassPrefix = "pt-", 
         RestrictToSupportedLanguages = false, 
         RestrictToSupportedNumberingFormats = false, 
         ImageHandler = imageInfo => 
         { 
          ++imageCounter; 
          string extension = imageInfo.ContentType.Split('/')[1].ToLower(); 
          ImageFormat imageFormat = null; 
          if (extension == "png") imageFormat = ImageFormat.Png; 
          else if (extension == "gif") imageFormat = ImageFormat.Gif; 
          else if (extension == "bmp") imageFormat = ImageFormat.Bmp; 
          else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg; 
          else if (extension == "tiff") 
          { 
           extension = "gif"; 
           imageFormat = ImageFormat.Gif; 
          } 
          else if (extension == "x-wmf") 
          { 
           extension = "wmf"; 
           imageFormat = ImageFormat.Wmf; 
          } 

          if (imageFormat == null) 
           return null; 

          string base64 = null; 
          try 
          { 
           using (MemoryStream ms = new MemoryStream()) 
           { 
            imageInfo.Bitmap.Save(ms, imageFormat); 
            var ba = ms.ToArray(); 
            base64 = System.Convert.ToBase64String(ba); 
           } 
          } 
          catch (System.Runtime.InteropServices.ExternalException) 
          { return null; } 


          ImageFormat format = imageInfo.Bitmap.RawFormat; 
          ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders().First(c => c.FormatID == format.Guid); 
          string mimeType = codec.MimeType; 

          string imageSource = string.Format("data:{0};base64,{1}", mimeType, base64); 

          XElement img = new XElement(Xhtml.img, 
           new XAttribute(NoNamespace.src, imageSource), 
           imageInfo.ImgStyleAttribute, 
           imageInfo.AltText != null ? 
            new XAttribute(NoNamespace.alt, imageInfo.AltText) : null); 
          return img; 
         } 

        }; 
        XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings); 

        var html = new XDocument(new XDocumentType("html", null, null, null), htmlElement); 
        var htmlString = html.ToString(SaveOptions.DisableFormatting); 
        return htmlString; 
       } 
      } 
     } 
     catch (Exception) 
     { 
      return "File contains corrupt data"; 
     } 
    } 

到目前爲止,一切都顯得漂亮和容易的,但後來我意識到,頁眉和頁腳部分只是skipt,所以我不得不以某種方式將它們轉換。 我試圖使用HeaderPart的方法GetStream(),但當然是異常拋出,因爲Header樹與Document的不一樣。

然後,我決定使用openXML的WordprocessingDocument headerDoc = WordprocessingDocument.Create(headerStream,Document)將新文檔(與此有關的時間很長)解壓爲Header和Footer,但不幸的是,這個文檔的轉換同樣可能是沒有意義的,因爲這只是創建一個普通的docx文檔沒有任何設置,樣式,web設置等。這花了很多時間去想出來。

因此,我終於決定創建一個新的文件通過Cathal的DocX庫,它最終來到現場。代碼如下:

public string ConvertHeaderToHtml(HeaderPart header) 
    { 

     using (MemoryStream headerStream = new MemoryStream()) 
     { 
      //Cathal's Docx Create 
      var newDocument = Novacode.DocX.Create(headerStream); 
      newDocument.Save(); 

      using (WordprocessingDocument headerDoc = WordprocessingDocument.Open(headerStream,true)) 
      { 
       var headerParagraphs = new List<OpenXmlElement>(header.Header.Elements()); 
       var mainPart = headerDoc.MainDocumentPart; 

       //Cloning the List is necesery because it will throw exception for the reason 
       // that you are working with refferences of the Elements 
       mainPart.Document.Body.Append(headerParagraphs.Select(h => (OpenXmlElement)h.Clone()).ToList()); 

       //Copies the Header RelationShips as Document's 
       foreach (IdPartPair parts in header.Parts) 
       { 
        //Very important second parameter of AddPart, if not set the relationship ID is being changed 
        // and the wordDocument pictures, etc. wont show 
        mainPart.AddPart(parts.OpenXmlPart,parts.RelationshipId); 
       } 
       headerDoc.MainDocumentPart.Document.Save(); 
       headerDoc.Save(); 
       headerDoc.Close(); 
      } 
      return ConvertToHtml(headerStream.ToArray()); 
     } 
    } 

所以這是與頭。我傳遞HeaderPart並獲取它的Header和Elements。提取關係,如果頭中有圖像,並將其導入文檔本身,並且文檔已準備好進行轉換,這非常重要。

使用相同的步驟將Html生成出頁腳。

希望這將有助於他的一些職責。

+0

我有用於從3個html字符串 (HtmlBody,HtmlHeader,HtmlFooter)創建Word文檔的代碼。那裏也有幾個基石,如果需要的話,我會努力上傳它。 –