2016-09-01 77 views
1

我正在使用lucene.net索引我的pdf文件。刷新索引後,它會多次顯示相同的文檔(=我刷新索引的次數)。lucene.net索引中的重複文檔

我正在使用最新版本的lucene.net索引(Lucene.net 3.0.3)。

這是我的索引代碼。

public void refreshIndexes() 
    { 
     // Create Index Writer 
     string strIndexDir = @"Z:\Munavvar\LuceneTest\index"; 
     IndexWriter writer = new IndexWriter(Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)), new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED); 

     writer.DeleteAll(); 
     // Find all files in root folder create index on them 
     List<string> lstFiles = searchFiles(@"Z:\Munavvar\LuceneTest\PDFs"); 
     foreach (string strFile in lstFiles) 
     { 
      Document doc = new Document(); 
      string FileName = System.IO.Path.GetFileNameWithoutExtension(strFile); 
      string Text = ExtractTextFromPdf(strFile); 
      string Path = strFile; 
      string ModifiedDate = Convert.ToString(File.GetLastWriteTime(strFile)); 
      string DocumentType = string.Empty; 
      string Vault = string.Empty; 

      string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150); 
      foreach (var docs in ltDocumentTypes) 
      { 
       if (headerText.ToUpper().Contains(docs.searchText.ToUpper())) 
       { 
        DocumentType = docs.DocumentType; 
        Vault = docs.VaultName; ; 
       } 
      } 

      if (string.IsNullOrEmpty(DocumentType)) 
      { 
       DocumentType = "Default"; 
       Vault = "Default"; 
      } 

      doc.Add(new Field("filename", FileName, Field.Store.YES, Field.Index.ANALYZED)); 
      doc.Add(new Field("text", Text, Field.Store.YES, Field.Index.ANALYZED)); 
      doc.Add(new Field("path", Path, Field.Store.YES, Field.Index.NOT_ANALYZED)); 
      doc.Add(new Field("modifieddate", ModifiedDate, Field.Store.YES, Field.Index.ANALYZED)); 
      doc.Add(new Field("documenttype", DocumentType, Field.Store.YES, Field.Index.ANALYZED)); 
      doc.Add(new Field("vault", Vault, Field.Store.YES, Field.Index.ANALYZED)); 

      writer.AddDocument(doc); 
     } 
     writer.Optimize(); 
     writer.Dispose(); 
    } 

這裏是我的索引searcing代碼

public List<IndexDocument> searchFromIndexes(string searchText) 
    { 
     try 
     { 
      #region search in indexes and fill list 
      // Create list 
      List<IndexDocument> searchResult = new List<IndexDocument>(); 

      if (!string.IsNullOrEmpty(searchText)) 
      { 
       string strIndexDir = @"Z:\Munavvar\LuceneTest\index"; 
       var analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30); 
       IndexSearcher searcher = new IndexSearcher(Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir))); 

       // parse the query, "text" is the default field to search 
       Lucene.Net.QueryParsers.QueryParser parser = new Lucene.Net.QueryParsers.QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", analyzer); 


       Query query = parser.Parse(searchText); 

       // search 
       TopDocs hits = searcher.Search(query, searcher.MaxDoc); 

       // showing first TotalHits results 
       for (int i = 0; i < hits.TotalHits; i++) 
       { 
        // get the document from index 
        Document doc = searcher.Doc(hits.ScoreDocs[i].Doc); 

        // create a new row with the result data 
        searchResult.Add(new IndexDocument() 
         { 
          FileName = doc.Get("filename"), 
          Text = doc.Get("text"), 
          Path = doc.Get("path"), 
          ModifiedDate = doc.Get("modifieddate"), 
          Vault = doc.Get("vault"), 
          DocumentType = doc.Get("documenttype"), 
         }); 

       } 
       searcher.Dispose(); 
      } 
      return searchResult; 
      #endregion 

     } 
     catch (Exception ex) 
     { 
      throw ex; 
     } 
    } 

UPDATE

我有調用refreshIndexes方法窗口中的一個按鈕。

這將清除舊索引當我關閉並重新運行應用程序並單擊該按鈕

+0

的第三個參數那是因爲,你需要刪除以前的數據來創建新的文檔 。順便說一下,什麼是'ltDocumentTypes' –

+1

@RichaGarg - 「IndexWriter」的第三個參數指定是否應覆蓋或附加現有索引(如果有的話)。它是'真的',舊的索引*應該被刪除。 – femtoRgon

+1

你能提供一些關於你如何搜索這個索引的信息嗎?我想知道的一件事是,如果有可能讓老讀者在某處開放,似乎你將不得不在'MultiReader'中收集它們,不過... – femtoRgon

回答

0

拿出一個解決方案。

問題: 我從全局類對象調用refreshIndexes方法。

VaultIndexes vIndexes = new VaultIndexes(); 
private void btnRefreshIndex_Click(object sender, RoutedEventArgs e) 
{ 
    vIndexes.refreshIndexes(); 
} 

解決方法:每次 科瑞新對象的對象。

private void btnRefreshIndex_Click(object sender, RoutedEventArgs e) 
{ 
    VaultIndexes vIndexes = new VaultIndexes(); 
    vIndexes.refreshIndexes(); 
} 

我不知道爲什麼它正在創建全球一流 對象重複的文檔。

由於@RichaGarg在評論狀態,它不能根據的IndexWriter

IndexWriter writer = new IndexWriter(Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)), new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);