2015-06-26 134 views
11

我正在使用Tika解析大型pdf和word文檔,但是我得到了他的下面的錯誤消息。如何使用TIKA讀取大文件?

Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). 

如何增加限制?

+0

完全取決於您如何致電Apache Tika。你是怎麼打電話給Apache Tika的? – Gagravarr

回答

15

假設你基本上遵循Tika example for extracting to plain text,那麼所有你需要做的是create your BodyContentHandler with a write limit of -1禁用寫入限制,如在解釋javadocs

你的代碼會看起來像(inspired by the example):

BodyContentHandler handler = new BodyContentHandler(-1); 

InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc"); 
AutoDetectParser parser = new AutoDetectParser(); 
Metadata metadata = new Metadata(); 
try { 
    parser.parse(stream, handler, metadata); 
    return handler.toString(); 
} finally { 
    stream.close(); 
}