使用Tika 1.10解析器獲取文件內容

嘗試使用Tika解析器獲取文件的內容時，我遇到了一個不尋常的問題。以下代碼在JUnit測試（即，我能夠獲取每個文件的文本內容）中運行時具有多種類型的文件輸入（例如doc，docx，txt，pdf）。當我在我的應用程序中運行此代碼時，不會返回任何文本。沒有例外，只是一個來自handler.toString（）的空字符串。使用Tika 1.10解析器獲取文件內容

public static String parseFile(final String path, final int charCountLimit) { 

    if(path == null){ 
     throw new InvalidParameterException("parameter is null"); 
    } 

    if(charCountLimit < -1 || charCountLimit == 0){ 
     throw new InvalidParameterException("char count limit is out of range"); 
    } 

    final File file = new File(path); 

    if(! file.exists()){ 
     throw new InvalidParameterException(String.format("file does not exist %s", path)); 
    } 

    try (InputStream stream = new FileInputStream(file.getAbsolutePath());){ 
     final AutoDetectParser parser = new AutoDetectParser(); 
     final BodyContentHandler handler = new BodyContentHandler(charCountLimit); 

     Metadata metadata = new Metadata(); 
     /* the following setting is required for Office 2007 and later files, 
     * despite not being specified in the Tika Parser documentation 
     */ 
     metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName()); 

     parser.parse(stream, handler, metadata); 
     return handler.toString(); 

    } catch (EncryptedDocumentException e){ 
     //handle exception 
    } catch (IOException | SAXException | TikaException e) { 
     //handle exception 
    } 
}

我首先想到的是我的應用程序的東西我使用的文件，但我已經通過我的文件系統上進行靜態參考測試用例文件之一排除了這一可能性。

我還有一個想法是我有某種版本衝突。在我的項目的POM中，我參考了tika-core的1.10版本，但是一個母版POM指定了1.8版本。我已將父POM的參考更改爲1.10，但問題依然存在。

<dependency> 
     <groupId>org.apache.tika</groupId> 
     <artifactId>tika-parsers</artifactId> 
     <version>1.10</version> 
    </dependency> 
    <dependency> 
     <groupId>org.apache.tika</groupId> 
     <artifactId>tika-core</artifactId> 
     <version>1.10</version> 
    </dependency>

我將不勝感激有關如何解決此問題的建議。

UPDATE

已經通過http://wiki.apache.org/tika/Troubleshooting%20Tika#No_Content_Extracted工作，我已經計算出所有的解析器失蹤。在JUnit中，

org.apache.tika.parser.DefaultParser

包含58個分析器。當在我的JBoss 8服務器上運行時，在應用程序中，DefaultParser不包含解析器。在添加JVM參數時

-Dorg.apache.tika.service.error.warn=true

沒有指示無法加載解析器的java.lang.NoClassDefFoundError。

來源

2015-08-17 Gordon

您是否試過遵循Apache Tika疑難解答頁面的「無內容提取」部分]（http://wiki.apache.org/tika/Troubleshooting%20Tika#No_Content_Extracted）？如果是這樣，在你遇到問題之前你有多遠，你打到了什麼？ – Gagravarr

我還沒有遇到過這個疑難解答頁面。感謝您的參考。我會審查它。 – Gordon

我修復了我的問題。該問題與包含我的「解析文件」jar的EAR文件中的依賴關係有關。

在我的EAR的POM中，已經有對tika-core的依賴引用。在運行時，EAR的tika-core副本用於實例化AutoDetectParser。由於我在EAR的POM中沒有對tika解析器的依賴引用，因此無法加載解析器類。

所以，看起來問題是由不正確的Maven POM依賴配置造成的，由於DefaultParser（由AutoDetectParser獲取）在缺省情況下不會默認生成任何輸出（或拋出異常）找不到任何解析器。

來源

2015-08-19 13:17:43 Gordon

使用Tika 1.10解析器獲取文件內容

回答

相關問題