的Java讀取使用POI

你好我試圖讀取DOC和DOCX文件中的文本.doc文件，對DOC文件我這樣做的Java讀取使用POI

package test; 
import java.io.File; 
import java.io.FileInputStream; 
import org.apache.poi.hwpf.HWPFDocument; 
import org.apache.poi.hwpf.extractor.WordExtractor; 

public class ReadFile { 
public static void main(String[] args) { 
     File file = null; 
     WordExtractor extractor = null; 
     try { 

      file = new File("C:\\Users\\rijo\\Downloads\\r.doc"); 
      FileInputStream fis = new FileInputStream(file.getAbsolutePath()); 
      HWPFDocument document = new HWPFDocument(fis); 
      extractor = new WordExtractor(document); 
      String fileData = extractor.getText(); 
      System.out.println(fileData); 
     } catch (Exception exep) { 
     } 
    } 
}

但是這給了我一個org/apache/poi/OldFileFormatException例外。

任何想法如何解決這個問題？

此外我需要閱讀Docx和PDF文件？任何好的方法來讀取所有類型的文件？

來源

2013-10-14 Rijo Joseph

您使用的是哪個版本的POI？ – Paolo

如果你看看OldFileFormatException的javadoc，就可以看到該

基類中的所有異常的是POI拋出在它給了一個文件，該文件早於當前支持的事件的原因。

這意味着您使用的r.doc不受HWPFDocument的支持。可能是它支持最新格式（docx也有相當長的一段時間了。不知道ApachePOI是否支持doc格式在HWPFDocument）。

來源

2013-10-14 10:58:32 SudoRahul

我嘗試使用.docx文件，但得到相同的異常..你知道任何其他方式來閱讀所有.doc .docx .pdf文件？ –

使用下面的罐（如果版本號都在這裏扮演一個角色）：

dom4j-1.7-20060614 
poi-3.9-20121203 
poi-ooxml-3.9-20121203 
poi-ooxml-schemas-3.9-20121203 
poi-scratchpad-3.9-20121203 
xmlbeans-2.4.0

我打這件事：

import java.io.File; 
import java.io.FileInputStream; 
import java.io.FileNotFoundException; 
import java.io.IOException; 

import org.apache.poi.xwpf.extractor.XWPFWordExtractor; 
import org.apache.poi.xwpf.usermodel.XWPFDocument; 
import org.apache.poi.hwpf.HWPFDocument; 
import org.apache.poi.hwpf.extractor.WordExtractor; 

public class SO { 
public static void main(String[] args){ 

      //Alternate between the two to check what works. 
    //String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx"; 
    String FilePath = "D:\\Users\\username\\Desktop\\Bob.doc"; 
    FileInputStream fis; 

    if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx 
    try { 
     fis = new FileInputStream(new File(FilePath)); 
     XWPFDocument doc = new XWPFDocument(fis); 
     XWPFWordExtractor extract = new XWPFWordExtractor(doc); 
     System.out.println(extract.getText()); 
    } catch (IOException e) { 

     e.printStackTrace(); 
    } 
    } else { //is not a docx 
     try { 
      fis = new FileInputStream(new File(FilePath)); 
      HWPFDocument doc = new HWPFDocument(fis); 
      WordExtractor extractor = new WordExtractor(doc); 
      System.out.println(extractor.getText()); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 
    } 
    } 
}

這讓我讀，無論從.DOCX文本和.doc。如果這在您的電腦上無法正常工作，您可能會遇到與您正在使用的外部容器有關的問題。

儘管:) 祝你好運！

來源

2013-10-14 13:08:45 Levenal

@RijoJoseph我已根據您先前的評論更新了我的答案。 – Levenal

我不知道爲什麼你只使用WordExtractor從.doc中獲取文本。對我來說是足夠用了一個方法：

import org.apache.poi.hwpf.HWPFDocument; 
... 
File fin = new File(yourFilePath); 
FileInputStream fis = new FileInputStream(fin); 
HWPFDocument doc = new HWPFDocument(fis); 
String text = doc.getDocumentText(); 
System.out.println(text); 
...

要以.PDF工作使用其他的Apache：pdfbox。

來源

2015-10-27 09:57:52

的Java讀取使用POI

回答

相關問題