我該如何改善Lucene.net的索引速度

我正在使用lucene.net來索引我的pdf文件。大約需要40分鐘索引15000 pdfs和索引時間增加與我的文件夾中的pdf文件數量增加。我該如何改善Lucene.net的索引速度

我該如何提高lucene.net中的索引速度？
是否有任何其他索引服務具有快速索引性能？

我正在使用最新版本的lucene.net索引（Lucene.net 3.0.3）。

這是我的索引代碼。

public void refreshIndexes() 
     { 
      // Create Index Writer 
      string strIndexDir = @"E:\LuceneTest\index"; 
      IndexWriter writer = new IndexWriter(Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)), new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED); 

      // Find all files in root folder create index on them 
      List<string> lstFiles = searchFiles(@"E:\LuceneTest\PDFs"); 
      foreach (string strFile in lstFiles) 
      { 
       Document doc = new Document(); 
       string FileName = System.IO.Path.GetFileNameWithoutExtension(strFile); 
       string Text = ExtractTextFromPdf(strFile); 
       string Path = strFile; 
       string ModifiedDate = Convert.ToString(File.GetLastWriteTime(strFile)); 
       string DocumentType = string.Empty; 
       string Vault = string.Empty; 

       string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150); 
       foreach (var docs in ltDocumentTypes) 
       { 
        if (headerText.ToUpper().Contains(docs.searchText.ToUpper())) 
        { 
         DocumentType = docs.DocumentType; 
         Vault = docs.VaultName; ; 
        } 
       } 

       if (string.IsNullOrEmpty(DocumentType)) 
       { 
        DocumentType = "Default"; 
        Vault = "Default"; 
       } 

       doc.Add(new Field("filename", FileName, Field.Store.YES, Field.Index.ANALYZED)); 
       doc.Add(new Field("text", Text, Field.Store.YES, Field.Index.ANALYZED)); 
       doc.Add(new Field("path", Path, Field.Store.YES, Field.Index.NOT_ANALYZED)); 
       doc.Add(new Field("modifieddate", ModifiedDate, Field.Store.YES, Field.Index.ANALYZED)); 
       doc.Add(new Field("documenttype", DocumentType, Field.Store.YES, Field.Index.ANALYZED)); 
       doc.Add(new Field("vault", Vault, Field.Store.YES, Field.Index.ANALYZED)); 

       writer.AddDocument(doc); 
      } 
      writer.Optimize(); 
      writer.Dispose(); 
     }

來源

2016-07-30 Munavvar

你真的需要調用'writer.Optimize（）'嗎？ 'writer.Commit（）'不夠嗎？ – sisve

感謝回覆@SimonSvensson。 Optimize（）不是必需的。通過commit（）嘗試，性能沒有提高。 – Munavvar

@Munavvar，在提出任何更改之前，您是否嘗試爲相關方法添加一些基準？我會對searchFiles和ExtractTextFromPdf方法特別感興趣。我相信這個問題可能在後者中，因爲你的代碼看起來不錯（除了不應該分析的日期之外）。此外，您的PDF文件的大小是多少？您可以將索引和分析限制爲相關的字符數。 – AR1

索引部分看起來沒問題。請注意，IndexWriter是線程安全的，因此如果使用多核機器，使用Parallel.Foreach（MaxConcurrency設置爲核心數量，使用該值）可能會有所幫助。

但是你讓文件類型檢測部分瘋狂GC。所有的ToUpper（）都很痛苦。

在lstFiles循環之外。以大寫字母創建ltDocumentTypes .searchText副本

var upperDocTypes = ltDocumentTypes.Select(x=>x.searchText.ToUpper()).ToList();

外的文檔類型的環路的創建另一個字符串
```
string headerTestUpper = headerText.ToUpper(); 
```

當它找到一個匹配的「破發」。一旦你找到一個匹配並且阻止所有下面的迭代，它將退出循環。當然，這首先意味着比賽，而你的是這場比賽最後（如果有差別給你）

string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150); 
foreach (var searchText in upperDocTypes) 
{ 
    if (headerTextUpper.Contains(searchText)) 
    { 
     DocumentType = docs.DocumentType; 
     Vault = docs.VaultName; 
     break; 
    } 
}

根據ltDocumentTypes的大小，這可能不會給你太大的起色。

我敢打賭，最昂貴的部分，如果ExtractTextFromPdf。通過分析器或使用一些StopWatches進行測試將會使您的成本降到最低。

來源

2017-01-16 17:35:56 AndyPook

我該如何改善Lucene.net的索引速度

回答

相關問題