Lucene：如何存儲文件內容？

我試圖索引和存儲文件內容（純文本），但似乎用這種方式是不可能的：Lucene：如何存儲文件內容？

protected Document getDocument(File f) throws Exception { 
    Document doc = new Document(); 
    Field contents = new Field("contents", new FileReader(f)); 
    Field filename = new Field("filename", f.getName(), Field.Store.YES, Field.Index.ANALYZED); 
    doc.add(contents); 
    return doc; 
}

如何存儲純文本文件的內容（不帶任何標籤）？

來源

2012-10-04 gaffcz

只需讀取文件內容，並使用另一場構造，像

protected Document getDocument(File f) throws Exception { 
    Document doc = new Document(); 
    Field contents = new Field("contents", new Scanner(f).useDelimiter("\\A").next(), Store.YES, Index.NO); // you should actually close the scanner 
    Field filename = new Field("filename", f.getName(), Store.YES, Index.ANALYZED); 
    doc.add(contents); 
    doc.add(filename); 
    return doc; 
}

來源

2012-10-05 16:51:03 mindas

謝謝，它的工作原理！ – gaffcz

看看Apache Tika（http://tika.apache.org/）。他們有一個很好的庫，可以從HTML和其他結構化文檔中提取文本。這將有助於從HTML中提取文本。

至於存儲在lucene索引中，根據您的需要，您可以在存儲它之前將標籤去掉。或者，您可以使用它創建一個分析器，以便在索引時標記標籤。

來源

2012-10-04 13:22:08 jcern

謝謝，我會努力的！ – gaffcz

Lucene：如何存儲文件內容？

回答

相關問題