一次寫入一個文檔的Lucene索引，隨着時間的推移逐漸減慢

我們有一個程序，它不斷運行，執行各種操作，並更改數據庫中的某些記錄。這些記錄使用Lucene進行索引。通過使用indexWriter.deleteDocuments(..)然後indexWriter.addDocument(..)一次寫入一個文檔的Lucene索引，隨着時間的推移逐漸減慢

開放分貝交易，開放的Lucene的IndexWriter
進行更改，以在交易數據庫，並更新該實體在Lucene的：所以每次我們改變一個實體的時間，我們做這樣的事情。
如果一切順利，提交db事務並提交IndexWriter。

這工作正常，但隨着時間的推移，indexWriter.commit()需要越來越多的時間。最初大約需要0.5秒，但在幾百次這樣的事務之後，需要3秒以上。如果腳本運行時間更長，我不懷疑它會花更長的時間。

我的解決方案至今，一直到現在每一次先用indexWriter.deleteAll()然後重新添加的所有文件中註釋掉indexWriter.addDocument(..)和indexWriter.commit()，並重新創建整個索引，一個Lucene的transction內/ IndexWriter類（約250K左右的文件14秒）。但是，這顯然違背了數據庫和Lucene提供的交易方法，它們保持兩者同步，並且保持對使用Lucene搜索的工具的用戶可見的數據庫更新。

看來很奇怪，我可以在14秒內添加250k文件，但添加1個文件需要3秒。我做錯了什麼，我該如何改善這種狀況？

來源

2015-08-28 Adrian Smith

你能解決它與背景任務？你可能會受到10秒的處罰，但對許多應用程序來說可以這麼做 – AdamSkywalker

@AdamSkywalker - 但它變得越來越慢，什麼時候需要1小時，10小時或2天？ –

你在做什麼錯誤是假設Lucene的built-in transactional capabilities具有與典型關係數據庫相媲美的性能和保證，當時爲they really don't。更具體地說，在您的情況下，提交會將所有索引文件與磁盤同步，從而使提交時間與索引大小成比例。這就是爲什麼你的indexWriter.commit()需要越來越多的時間。該Javadoc爲IndexWriter.commit()甚至警告說：

這可能是一個代價高昂的操作，所以你應該在你的應用程序測試成本並做到這一點只有在真正必要的。

你能想象數據庫文檔告訴你避免提交嗎？

因爲你的主要目標似乎是保持數據庫更新可見通過的Lucene搜索及時，改善這種狀況，請執行以下操作：

有indexWriter.deleteDocuments(..)後indexWriter.addDocument(..)觸發一個成功的數據庫提交，而不是以前
執行indexWriter.commit()週期性，而不是每一筆交易的，只是爲了確保您的更改最終會寫入磁盤
使用SearcherManager用於搜索和定期調用maybeRefresh()到在合理的時間範圍內查看更新的文檔

以下是演示如何通過定期執行maybeRefresh()來檢索文檔更新的示例程序。它建立100000個文檔索引，使用ScheduledExecutorService設置定期調用commit()和maybeRefresh()，提示您更新單個文檔，然後重複搜索，直到更新可見。所有資源都在程序終止時正確清理。請注意，更新變爲可見時的控制因素是調用maybeRefresh()時，而不是commit()。

import java.io.IOException; 
import java.nio.file.Paths; 
import java.util.Scanner; 
import java.util.concurrent.*; 
import org.apache.lucene.analysis.standard.StandardAnalyzer; 
import org.apache.lucene.document.*; 
import org.apache.lucene.index.*; 
import org.apache.lucene.search.*; 
import org.apache.lucene.store.FSDirectory; 

public class LucenePeriodicCommitRefreshExample { 
    ScheduledExecutorService scheduledExecutor; 
    MyIndexer indexer; 
    MySearcher searcher; 

    void init() throws IOException { 
     scheduledExecutor = Executors.newScheduledThreadPool(3); 
     indexer = new MyIndexer(); 
     indexer.init(); 
     searcher = new MySearcher(indexer.indexWriter); 
     searcher.init(); 
    } 

    void destroy() throws IOException { 
     searcher.destroy(); 
     indexer.destroy(); 
     scheduledExecutor.shutdown(); 
    } 

    class MyIndexer { 
     IndexWriter indexWriter; 
     Future commitFuture; 

     void init() throws IOException { 
      indexWriter = new IndexWriter(FSDirectory.open(Paths.get("C:\\Temp\\lucene-example")), new IndexWriterConfig(new StandardAnalyzer())); 
      indexWriter.deleteAll(); 
      for (int i = 1; i <= 100000; i++) { 
       add(String.valueOf(i), "whatever " + i); 
      } 
      indexWriter.commit(); 
      commitFuture = scheduledExecutor.scheduleWithFixedDelay(() -> { 
       try { 
        indexWriter.commit(); 
       } catch (IOException e) { 
        e.printStackTrace(); 
       } 
      }, 5, 5, TimeUnit.MINUTES); 
     } 

     void add(String id, String text) throws IOException { 
      Document doc = new Document(); 
      doc.add(new StringField("id", id, Field.Store.YES)); 
      doc.add(new StringField("text", text, Field.Store.YES)); 
      indexWriter.addDocument(doc); 
     } 

     void update(String id, String text) throws IOException { 
      indexWriter.deleteDocuments(new Term("id", id)); 
      add(id, text); 
     } 

     void destroy() throws IOException { 
      commitFuture.cancel(false); 
      indexWriter.close(); 
     } 
    } 

    class MySearcher { 
     IndexWriter indexWriter; 
     SearcherManager searcherManager; 
     Future maybeRefreshFuture; 

     public MySearcher(IndexWriter indexWriter) { 
      this.indexWriter = indexWriter; 
     } 

     void init() throws IOException { 
      searcherManager = new SearcherManager(indexWriter, true, null); 
      maybeRefreshFuture = scheduledExecutor.scheduleWithFixedDelay(() -> { 
       try { 
        searcherManager.maybeRefresh(); 
       } catch (IOException e) { 
        e.printStackTrace(); 
       } 
      }, 0, 5, TimeUnit.SECONDS); 
     } 

     String findText(String id) throws IOException { 
      IndexSearcher searcher = null; 
      try { 
       searcher = searcherManager.acquire(); 
       TopDocs topDocs = searcher.search(new TermQuery(new Term("id", id)), 1); 
       return searcher.doc(topDocs.scoreDocs[0].doc).getField("text").stringValue(); 
      } finally { 
       if (searcher != null) { 
        searcherManager.release(searcher); 
       } 
      } 
     } 

     void destroy() throws IOException { 
      maybeRefreshFuture.cancel(false); 
      searcherManager.close(); 
     } 
    } 

    public static void main(String[] args) throws IOException { 
     LucenePeriodicCommitRefreshExample example = new LucenePeriodicCommitRefreshExample(); 
     example.init(); 
     Runtime.getRuntime().addShutdownHook(new Thread() { 
      @Override 
      public void run() { 
       try { 
        example.destroy(); 
       } catch (IOException e) { 
        e.printStackTrace(); 
       } 
      } 
     }); 

     try (Scanner scanner = new Scanner(System.in)) { 
      System.out.print("Enter a document id to update (from 1 to 100000): "); 
      String id = scanner.nextLine(); 
      System.out.print("Enter what you want the document text to be: "); 
      String text = scanner.nextLine(); 
      example.indexer.update(id, text); 
      long startTime = System.nanoTime(); 
      String foundText; 
      do { 
       foundText = example.searcher.findText(id); 
      } while (!text.equals(foundText)); 
      long elapsedTimeMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime); 
      System.out.format("it took %d milliseconds for the searcher to see that document %s is now '%s'\n", elapsedTimeMillis, id, text); 
     } catch (Exception e) { 
      e.printStackTrace(); 
     } finally { 
      System.exit(0); 
     } 
    } 
}

本示例已成功通過Lucene 5.3.1和JDK 1.8.0_66測試。

來源

2015-12-06 10:51:00 heenenee

我的第一種方法：不要經常犯這樣的情況。當您刪除並重新添加文檔時，您可能會觸發合併。合併有點慢。

如果您使用的是近實時的IndexReader，您仍然可以像以前一樣搜索（它不顯示已刪除的文檔），但是您不會收到提交罰款。您可以稍後再提交，以確保文件系統與您的索引保持同步。您可以在使用索引時執行此操作，因此您不必阻止所有其他操作。

另請參閱這個有趣的blog post（並閱讀其他帖子，他們提供了很好的信息）。

來源

2015-08-28 11:23:54 RobAu

我可以理解觸發合併可能會很慢，但是您會期望提交會隨着時間變慢嗎？（如果我不那麼頻繁地提交，那隻會拖延提交緩慢（1分鐘？10分鐘？）的時間，並且隨着此腳本永遠運行，它最終會達到這一點。） –

我工作過索引大小約爲10M文件。在我的筆記本電腦上，commit（）可能需要10秒鐘。但是，如果您使用'NTR'解決方案，這並不重要，因爲您不必等待** commit（）完成。 – RobAu

一次寫入一個文檔的Lucene索引，隨着時間的推移逐漸減慢

回答

相關問題