的Apache POI字到HTML的轉換 - 換句話說boundry

我使用下面的代碼轉換成Word到HTML文件的Apache POI字到HTML的轉換 - 換句話說boundry

public Map convert(String wordDocPath, String htmlPath, 
     Map conversionParams) 
{ 
    log.info("Converting word file "+wordDocPath) 
    try 
    { 
     String workingFolder = "C:\temp" 
     File workingFolderFile = new File(workingFolder) 

     FileInputStream fis = new FileInputStream(wordDocPath); 
     XWPFDocument document = new XWPFDocument(fis); 
     XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(workingFolderFile)); 
     options.setExtractor(new FileImageExtractor(workingFolderFile)) 
     File htmlFile = new File(htmlPath); 
     OutputStream out = new FileOutputStream(htmlFile) 
     XHTMLConverter.getInstance().convert(document, out, options); 

     log.info("Converted to HTML file "+htmlPath) 

    } 
    catch(Exception e) 
    { 
     log.error("Exception :"+e.getMessage(),e) 
    } 
}

代碼正確生成HTML輸出。

我需要在文檔中加入一些參數，如[[AGENT_NAME]]，我將在後面的代碼中用正則表達式替換。但是，apache poi並沒有將這種模式視爲單個字詞，而是將某些標籤插入其間的樣式中[[「，」]。我無法編寫正則表達式，並因此而替換參數。

apache poi如何決定字邊界？有沒有辦法控制它？

來源

2016-06-20 Fayaz

的Apache POI未在字邊界決定，這將是微軟Word的選項生成時原始文件... – Gagravarr

可以解釋一下嗎？任何鏈接都會有幫助。有沒有特殊字符是字邊界的一部分？ – Fayaz

調試代碼（XWPFDocument.paragraphs）並通過OpenOffice規範http://officeopenxml.com/WPparagraph.php，我瞭解到MsWord可以將文本分割成文檔中的任何位置。它甚至可以拆分不含任何特殊字符（如AGENTNAME）的簡單連續文本。但我們可以控制這種行爲嗎？如何將文本視爲一次運行？ – Fayaz

經過所有的努力，我終於決定編寫代碼來解析word doc和合並拆分運行。下面是代碼，希望這會幫助別人

注：我用的圖案${pattern}

void mergeSplittedPatterns(XWPFDocument document) 
{ 
    List<XWPFParagraph> paragraphs = document.paragraphs 

    for(XWPFParagraph paragraph : paragraphs) 
    { 
     List<XWPFRun> runs = paragraph.getRuns() 

     int firstCharRun,closingCharRun 
     boolean firstCharFound = false; 
     boolean secondCharFoundImmediately = false; 
     boolean closingCharFound = false; 
     boolean gotoNextRun = true 

     boolean scan = (runs!=null && runs.size()>0) 
     int index = 0 

     while(scan) 
     { 
      gotoNextRun = true; 
      XWPFRun run = runs.get(index) 
      String runText = run.getText(0) 
      if(runText!=null) 
       for (int i = 0; i < runText.length(); i++) 
      { 
       char character = runText.charAt(i); 

       if(secondCharFoundImmediately) 
       { 
        closingCharFound = (character=="}") 
        if(closingCharFound) 
        { 
         closingCharRun = index 

         if(firstCharRun==closingCharRun) 
         { 
          firstCharFound = secondCharFoundImmediately = closingCharFound = false 
          continue; 
         } 
         else 
         { 
          String mergedText= "" 
          for(int j=firstCharRun;j<=closingCharRun;j++) 
          { 
           mergedText += runs.get(j).getText(0) 
          } 
          runs.get(firstCharRun).setText(mergedText,0) 

          for(int j=closingCharRun;j>firstCharRun;j--) 
          { 
           paragraph.removeRun(j) 
          } 
          firstCharFound = secondCharFoundImmediately = closingCharFound = gotoNextRun = false 
          index = firstCharRun 
          break; 
         } 
        } 
       } 
       else if(firstCharFound) 
       { 
        secondCharFoundImmediately = (character=="{") 
        if(!secondCharFoundImmediately) 
        { 
         firstCharFound = secondCharFoundImmediately = closingCharFound = false 
        } 
       } 
       else if(character=="\$") 
       { 
        firstCharFound = true; 
        firstCharRun = index 
       } 
      } 

      if(gotoNextRun) 
      { 
       index++; 
      } 

      if(index>=runs.size()) 
      { 
       scan = false; 
      } 
     } 
    } 
}

來源

2016-08-02 05:39:52 Fayaz

的Apache POI字到HTML的轉換 - 換句話說boundry

回答

相關問題