減小索引文件的大小

好吧，在lucene indexing有很多疑惑之後，我嘗試了一個程序來索引文件夾中的每個文件，並能夠索引76個131 MB的文件，其中主要包括ppt，pdf和文檔。索引文件大小約爲80 MB，它在36秒內完成，並在7毫秒內搜索查詢。減小索引文件的大小

它更快嗎？
索引文件的大小是否正常？
任何方式來減少索引的大小？
每當我添加新的文件時，每次必須運行索引器程序。每當新文件是加法器時，是否有自動的方式進行索引編制？

這裏是我的索引文件，它實際上是從Lucene是在行動書

package lia.meetlucene; 
import org.apache.lucene.index.IndexWriter; 
import org.apache.lucene.analysis.standard.StandardAnalyzer; 
import org.apache.lucene.document.Document; 
import org.apache.lucene.document.Field; 
import org.apache.lucene.store.FSDirectory; 
import org.apache.lucene.store.Directory; 
import org.apache.lucene.util.Version; 
import java.io.File; 
import java.io.FileFilter; 
import java.io.IOException; 
import java.io.FileReader; 
public class Indexer { 
    public static void main(String[] args) throws Exception { 
    if (args.length != 2) { 
     throw new IllegalArgumentException("Usage: java " + Indexer.class.getName() 
     + " <index dir> <data dir>"); 
    } 
    String indexDir = args[0];   //1 
    String dataDir = args[1];   //2 

    long start = System.currentTimeMillis(); 
    Indexer indexer = new Indexer(indexDir); 
    int numIndexed; 
    try { 
     numIndexed = indexer.index(dataDir); 
    } finally { 
     indexer.close(); 
    } 
    long end = System.currentTimeMillis(); 

    System.out.println("Indexing " + numIndexed + " files took " 
     + (end - start) + " milliseconds"); 
    } 

    private IndexWriter writer; 

    public Indexer(String indexDir) throws IOException { 
    Directory dir = FSDirectory.open(new File(indexDir)); 
    writer = new IndexWriter(dir,   //3 
       new StandardAnalyzer(  //3 
        Version.LUCENE_30),//3 
       true,      //3 
          IndexWriter.MaxFieldLength.UNLIMITED); //3 
    } 

    public void close() throws IOException { 
    writer.close();        //4 
    } 

    public int index(String dataDir) 
    throws Exception { 
try{ 
    File[] files = new File(dataDir).listFiles(); 

    for (File f: files) { 
     if(f.isDirectory()) 
     { 
      index(f.getAbsolutePath()); 
     } 
     else if (!f.isDirectory() && 
      !f.isHidden() && 
      f.exists() && 
      f.canRead() 
     ) { 
     indexFile(f); 
     } 
    } 
} 
     catch (IOException e) { 
      e.printStackTrace(); 
     } 
    return writer.numDocs();      //5 
    } 


    protected Document getDocument(File f) throws Exception { 
    Document doc = new Document(); 
    doc.add(new Field("contents", new FileReader(f)));  //7 
    doc.add(new Field("filename", f.getName(),    //8 
       Field.Store.YES, Field.Index.NOT_ANALYZED));//8 
    doc.add(new Field("fullpath", f.getCanonicalPath(),  //9 
       Field.Store.YES, Field.Index.NOT_ANALYZED));//9 
    return doc; 
    } 

    private void indexFile(File f) throws Exception { 
    System.out.println("Indexing " + f.getCanonicalPath()); 
    Document doc = getDocument(f); 
    writer.addDocument(doc);        //10 
    } 
}

來源

2014-02-12 samnaction

幾個關於您的代碼註釋：

doc.add(new Field("contents", new FileReader(f))); //7 你確定這是對？如果您的文件是二進制文件（ppt，pdf ...），那麼您在這裏將原始字節編入索引，您應該查看文本提取工具，如tika 這會大大減少您的索引大小。
另外驗證您的索引是否使用compound file format，這會使它更小。

來源

2014-02-12 08:22:22 Persimmonium

實際上原始代碼被寫入索引文本文件。我刪除了過濾器來索引每個文件格式。這樣做是錯誤的嗎？ – samnaction

是的，這是錯誤的，你使用的是PDF文件的實際字節，而不是*文本內容*，你需要提取文本。 – Persimmonium

tika是否需要xml文件來索引文件？ – samnaction

減小索引文件的大​​小

回答

相關問題

減小索引文件的大小