增量索引lucene

我正在使用Lucene 3.6在Java中進行應用程序，並且想要增加一個速率。我已經創建了索引，並且我讀到了您要做的是打開現有索引，並檢查每個文檔索引和文檔修改日期，以查看它們是否有所不同，刪除索引文件並重新添加。我的問題是我不知道如何在Java Lucene中做到這一點。增量索引lucene

感謝

我的代碼是：

public static void main(String[] args) 
    throws CorruptIndexException, LockObtainFailedException, 
      IOException { 

    File docDir = new File("D:\\PRUEBASLUCENE"); 
    File indexDir = new File("C:\\PRUEBA"); 

    Directory fsDir = FSDirectory.open(indexDir); 
    Analyzer an = new StandardAnalyzer(Version.LUCENE_36); 
    IndexWriter indexWriter 
     = new IndexWriter(fsDir,an,MaxFieldLength.UNLIMITED); 


    long numChars = 0L; 
    for (File f : docDir.listFiles()) { 
     String fileName = f.getName(); 
     Document d = new Document(); 
     d.add(new Field("Name",fileName, 
         Store.YES,Index.NOT_ANALYZED)); 
     d.add(new Field("Path",f.getPath(),Store.YES,Index.ANALYZED)); 
     long tamano = f.length(); 
     d.add(new Field("Size",""+tamano,Store.YES,Index.ANALYZED)); 
     long fechalong = f.lastModified(); 
     d.add(new Field("Modification_Date",""+fechalong,Store.YES,Index.ANALYZED)); 
     indexWriter.addDocument(d); 
    } 

    indexWriter.optimize(); 
    indexWriter.close(); 
    int numDocs = indexWriter.numDocs(); 

    System.out.println("Index Directory=" + indexDir.getCanonicalPath()); 
    System.out.println("Doc Directory=" + docDir.getCanonicalPath()); 
    System.out.println("num docs=" + numDocs); 
    System.out.println("num chars=" + numChars);

}

感謝Edmondo1984，你幫助了我很多。

最後我做了如下所示的代碼。存儲文件的散列，然後檢查修改日期。

9300索引文件需要15秒，重新索引（沒有任何索引沒有更改，因爲沒有文件）需要15秒。我做錯了什麼或我可以優化代碼以減少？

感謝jtahlborn，做了我設法平衡indexReader時間來創建和更新。你不應該更新現有的索引應該更快地重新創建嗎？是否有可能進一步優化代碼？

if(IndexReader.indexExists(dir)) 
      { 
       //reader is a IndexReader and is passed as parameter to the function 
       //searcher is a IndexSearcher and is passed as parameter to the function 
       term = new Term("Hash",String.valueOf(file.hashCode())); 
       Query termQuery = new TermQuery(term); 
       TopDocs topDocs = searcher.search(termQuery,1); 
       if(topDocs.totalHits==1) 
       { 
        Document doc; 
        int docId,comparedate; 
        docId=topDocs.scoreDocs[0].doc; 
        doc=reader.document(docId); 
        String dateIndString=doc.get("Modification_date"); 
        long dateIndLong=Long.parseLong(dateIndString); 
        Date date_ind=new Date(dateIndLong); 
        String dateFichString=DateTools.timeToString(file.lastModified(), DateTools.Resolution.MINUTE); 
        long dateFichLong=Long.parseLong(dateFichString); 
        Date date_fich=new Date(dateFichLong); 
        //Compare the two dates 
        comparedates=date_fich.compareTo(date_ind); 
        if(comparedate>=0) 
        { 
         if(comparedate==0) 
         { 
          //If comparation is 0 do nothing 
          flag=2; 
         } 
         else 
         { 
          //if comparation>0 updateDocument 
          flag=1; 
         } 
        }

來源

2012-07-12 Jose Luis Vázquez López

你能在Java代碼中的javadoc？你不明白什麼？更具體地說... – Edmondo1984 2012-07-12 11:06:57

對不起，代碼已經設置 – 2012-07-13 09:57:43

該代碼是一些片段，從中你不會得到你想要的。你最好學習lucene是如何工作的，並且從零開始編寫它 – Edmondo1984 2012-07-16 06:08:38

根據Lucene數據模型，將文檔存儲在索引中。在每個文檔中，您將擁有要索引的字段（稱爲「分析」）和不被「分析」的字段，您可以在其中存儲稍後可能需要的時間戳和其他信息。

我有這種感覺，你有一定的文件和文件之間的混淆，因爲在你的第一篇文章中你談論文件，現在你試圖調用IndexFileNames.isDocStoreFile（file.getName（）），它實際上只告訴文件是一個包含Lucene索引的文件。

如果你理解Lucene的對象模型，寫你需要的代碼需要大約三分鐘：

你必須通過存儲非分析現場檢查，如果文檔索引已經存在（例如包含一個唯一的標識符），只需查詢Lucene。
如果您的查詢返回0個文檔，您會將新文檔添加到索引
如果您的查詢返回1個文檔，您將獲得其「timestamp」字段並將其與您嘗試的新文檔之一進行比較儲藏。然後，您可以使用文檔的docId將其從索引中刪除，必要時添加新文檔。

如果對方你一定要始終修改以前的值，你可以參考這個段從Lucene的行動：

public void testUpdate() throws IOException { 
    assertEquals(1, getHitCount("city", "Amsterdam")); 
    IndexWriter writer = getWriter(); 
    Document doc = new Document(); 
    doc.add(new Field("id", "1", 
    Field.Store.YES, 
    Field.Index.NOT_ANALYZED)); 
    doc.add(new Field("country", "Netherlands", 
    Field.Store.YES, 
    Field.Index.NO)); 
    doc.add(new Field("contents", 
    "Den Haag has a lot of museums", 
    Field.Store.NO, 
    Field.Index.ANALYZED)); 
    doc.add(new Field("city", "Den Haag", 
    Field.Store.YES, 
    Field.Index.ANALYZED)); 
    writer.updateDocument(new Term("id", "1"), 
    doc); 
    writer.close(); 
    assertEquals(0, getHitCount("city", "Amsterdam")); 
    assertEquals(1, getHitCount("city", "Den Haag")); 
}

正如你看到的，片斷使用未分析的ID，因爲我建議保存一個可查詢的簡單屬性，方法updateDocument先刪除然後重新添加文檔。

您可能要直接檢查在

http://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/index/IndexWriter.html#updateDocument(org.apache.lucene.index.Term,org.apache.lucene.document.Document）

來源

2012-07-17 06:48:00 Edmondo1984

增量索引lucene

回答

相關問題