2014-04-27 63 views
3

我想轉換包含一些圖像的文檔.doc。如何將它轉換爲*.html,這樣圖像將保持相同的位置?如何將這些圖像存儲在名爲image的單獨文件夾中,並將此文件夾用作圖像源?使用Apache POI在Java中將.doc轉換爲.html

我的代碼:

import java.io.BufferedWriter; 
import java.io.DataOutputStream; 
import java.io.File; 
import java.io.FileInputStream; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.io.OutputStreamWriter; 
import java.io.StringWriter; 
import javax.swing.JEditorPane; 
import javax.swing.JFrame; 
import javax.swing.JScrollPane; 
import javax.xml.parsers.DocumentBuilderFactory; 
import javax.xml.transform.OutputKeys; 
import javax.xml.transform.Transformer; 
import javax.xml.transform.TransformerFactory; 
import javax.xml.transform.dom.DOMSource; 
import javax.xml.transform.stream.StreamResult; 
import org.apache.poi.hwpf.HWPFDocument; 
import org.apache.poi.hwpf.converter.WordToHtmlConverter; 
import org.apache.poi.hwpf.extractor.WordExtractor; 
import org.apache.poi.xwpf.converter.core.FileImageExtractor; 
import org.apache.poi.xwpf.converter.core.FileURIResolver; 
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions; 
import org.w3c.dom.Document; 

public class TestWordToHtmlConverter { 
    private File docFile; 
    private File file; 

    public TestWordToHtmlConverter(File docFile) { 
     this.docFile = docFile; 
    } 

    public void convert(File file) { 
    this.file = file; 

     try { 
      FileInputStream finStream=new FileInputStream(docFile.getAbsolutePath()); 
      HWPFDocument doc=new HWPFDocument(finStream); 
      WordExtractor wordExtract=new WordExtractor(doc); 
      Document newDocument = DocumentBuilderFactory.newInstance() .newDocumentBuilder().newDocument(); 
      WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) ; 
      wordToHtmlConverter.processDocument(doc); 

      StringWriter stringWriter = new StringWriter(); 
      Transformer transformer = TransformerFactory.newInstance().newTransformer(); 

      transformer.setOutputProperty(OutputKeys.INDENT, "yes"); 
      transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8"); 
      transformer.setOutputProperty(OutputKeys.METHOD, "html"); 
      transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter)); 

      String html = stringWriter.toString(); 
      FileOutputStream fos=new FileOutputStream(new File("html/sample.html")); 
      DataOutputStream dos; 

      try { 
       BufferedWriter out = new BufferedWriter(new OutputStreamWriter(fos,"UTF-8"));  
       out.write(html); 
       out.close(); 
      } 
      catch (IOException e) { 
       e.printStackTrace(); 
      } 

      /*JEditorPane editorPane = new JEditorPane(); 
      editorPane.setContentType("text/html"); 
      editorPane.setEditable(false); 

      editorPane.setPage(file.toURI().toURL()); 

      JScrollPane scrollPane = new JScrollPane(editorPane);  
      JFrame f = new JFrame("Display Html File"); 
      f.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); 
      f.getContentPane().add(scrollPane); 
      f.setSize(512, 342); 
      f.setVisible(true);*/ 

     } catch(Exception e) { 
      e.printStackTrace(); 
     } 
    } 

    public static void main(String args[]) { 
     TestWordToHtmlConverter TTC=new TestWordToHtmlConverter(new File("docx/sample.doc")); 
     TTC.convert(TTC.docFile);   
    } 
} 

此實現不創建圖片或鏈接到他們。這可以 通過重寫AbstractWordConverter.processImage(元素, 布爾,照片)方法來改變

回答

3

正如API文檔說:

WordToHtmlConverter不會產生圖像或它們的鏈接。通過重寫AbstractWordConverter.processImage(Element, boolean, Picture)方法可以更改 。

如何重寫方法,你可以在這裏找到:

您可以嘗試使用基於Apache POI XWPF DOCX 2 XHTML轉換器:

也可以使用Apache Tika,構建於Apache POI之上。包括在Alfresco一個例子可以在這裏找到:

也有很多其他的轉換器。

+0

謝謝...現在我得到了解決方案 – sudhakar810

+0

不客氣。 –

+0

現在我得到了解決方案的圖像和它的正常工作。但是存在與子彈和編號有關的問題。包含列表的段落可以正確顯示。缺少編號。 – sudhakar810

相關問題