是否可以遍歷Lucene索引中存儲的文檔？

我有一些文檔存儲在一個docId字段的Lucene索引中。我想獲取索引中存儲的所有docIds。還有一個問題。文件數量約爲300 000份，所以我寧願將這些文件分成500份大小的文件。是否可以這樣做？是否可以遍歷Lucene索引中存儲的文檔？

來源

2010-02-22 Eugeniu Torica

IndexReader reader = // create IndexReader 
for (int i=0; i<reader.maxDoc(); i++) { 
    if (reader.isDeleted(i)) 
     continue; 

    Document doc = reader.document(i); 
    String docId = doc.get("docId"); 

    // do something with docId here... 
}

來源

2010-02-23 21:15:28 bajafresh4life

是什麼發生，如果（reader.isDeleted（i））的缺失？ – 2010-02-24 16:16:36

如果沒有執行isDeleted（）檢查，您將輸出以前刪除的文檔的ID – bajafresh4life 2010-02-25 03:34:51

要從上面完成評論。當索引重新打開時索引更改將被提交，因此reader.isDeleted（i）對於確保文檔有效是必需的。 – 2011-02-24 11:29:05

文檔編號（或ids）將是從0到IndexReader.maxDoc（） - 1的後續編號。這些數字不是持久的，只對打開的IndexReader有效。你可以檢查文檔是否與IndexReader.isDeleted（INT documentNumber）方法刪除

來源

2010-02-22 19:09:38 Yaroslav

Lucene的4

Bits liveDocs = MultiFields.getLiveDocs(reader); 
for (int i=0; i<reader.maxDoc(); i++) { 
    if (liveDocs != null && !liveDocs.get(i)) 
     continue; 

    Document doc = reader.document(i); 
}

此頁的詳細信息，

見LUCENE-2600：https://lucene.apache.org/core/4_0_0/MIGRATE.html

來源

2013-08-28 22:45:07 bcoughlan

這是由其他用戶回滾，但原始編輯器是正確的，liveDocs可以爲null – bcoughlan 2013-11-01 15:24:49

如果您使用.document（i），如上面的示例中所示，並跳過刪除的文檔，請小心如果您使用此方法對結果進行分頁。即：您有10個文檔/每個頁面列表，您需要獲取文檔。對於第6頁。您的輸入可能是這樣的：offset = 60，count = 10（文檔從60到70）。

IndexReader reader = // create IndexReader 
for (int i=offset; i<offset + 10; i++) { 
    if (reader.isDeleted(i)) 
     continue; 

    Document doc = reader.document(i); 
    String docId = doc.get("docId"); 
}

你將有一些問題，刪除的文件，因爲你不應該從開始偏移量= 60，但是從偏移量= 60 + 60之前

另一種我發現，出現刪除文件的數量是這樣的：

is = getIndexSearcher(); //new IndexSearcher(indexReader) 
    //get all results without any conditions attached. 
    Term term = new Term([[any mandatory field name]], "*"); 
    Query query = new WildcardQuery(term); 

    topCollector = TopScoreDocCollector.create([[int max hits to get]], true); 
    is.search(query, topCollector); 

    TopDocs topDocs = topCollector.topDocs(offset, count);

注意：用自己的值替換[[]]之間的文本。在大型指數上運行150萬條記錄，並在不到一秒的時間內得到隨機的10條結果。同意速度較慢，但如果您需要分頁，至少您可以忽略已刪除的文檔。

來源

2015-04-30 08:53:04 andreyro

還有查詢類命名MatchAllDocsQuery，我認爲它可以在這種情況下使用：

Query query = new MatchAllDocsQuery(); 
TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);

來源

2016-01-21 08:05:01

是否可以遍歷Lucene索引中存儲的文檔？

回答

相關問題