2012-11-12 71 views
1

我們使用Lucene.Net 3.0.3空白分析,我們指數的同名文件分離與Not_Analyzed下面Lucene.Net多行正則表達式搜索

 public static void WriteIndexes() 
    { 
     string indexPathRegex = ConfigurationManager.TfSettings.Application.CustomSettings["dbScritpsAddressRegex"]; 

     var analyzerRegex = new WhitespaceAnalyzer(); 
     var indexWriterRegex = new IndexWriter(indexPathRegex, analyzerRegex, IndexWriter.MaxFieldLength.UNLIMITED); 

     foreach (LuceneIndex l in Indexes) 
     { 
      var doc = new Document(); 
      doc.Add(new Field("ServerName", l.ServerName.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO)); 

      doc.Add(new Field("DatabaseName", l.DatabaseName.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.NO)); 
      doc.Add(new Field("SchemaName", l.SchemaName.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO)); 
      doc.Add(new Field("ObjectType", l.ObjectType.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO)); 
      doc.Add(new Field("ObjectName", l.ObjectName.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO)); 
      doc.Add(new Field("Script", l.Script, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO)); 
      doc.Add(new Field("Script", l.Script, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO)); 

      indexWriterRegex.AddDocument(doc); 
     } 
     indexWriterRegex.Optimize(); 
     analyzerRegex.Close(); 
     indexWriterRegex.Close(); 




    } 

所示。當我們看兩個字段的分析選項但是當我們查找多行正則表達式時,如果搜索文件的大小小於16 KB,則可以。但是當它大於16 KB時,Lucene不會找到搜索關鍵字。這是一個錯誤?我們如何解決這個問題?

樣品關鍵字:.*taxId.*\n.*customerNo.*

 public List<item> SearchAllScriptInIndex() 
    { 
     string indexPathRegex = ConfigurationManager.TfSettings.Application.CustomSettings["dbScritpsAddressRegex"]; 
     var searcher = new Lucene.Net.Search.IndexSearcher(indexPathRegex, false); 

     const int hitsLimit = 1000000; 
     var analyzer = new WhitespaceAnalyzer(); 

     var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, new[] { "Script", "DatabaseName", "ObjectType", "ServerName" }, analyzer); 

     Term t = new Term("Script", Expression); 
     RegexQuery scriptQuery = new RegexQuery(t); 

     string s = string.Format("({0}) AND {1}", serverAndDatabasescript, objectTypeScript); 
     var query = parser.Parse(s); 

     BooleanQuery booleanQuery = new BooleanQuery(); 
     booleanQuery.Add(query, BooleanClause.Occur.MUST); 
     booleanQuery.Add(scriptQuery, BooleanClause.Occur.MUST); 

     var hits = searcher.Search(booleanQuery, null, hitsLimit, Sort.RELEVANCE).ScoreDocs; 

     List<item> results = new List<item>(); 
     List<string> values = new List<string>(); 
     Dictionary<int, string> newLineIndices = new Dictionary<int, string>(); 
     foreach (var hit in hits) 
     { 
      var hitDocument = searcher.Doc(hit.Doc); 
      string contentValue = hitDocument.Get("Script"); 
     LuceneIndex item = new LuceneIndex(); 
     item.ServerName = hitDocument.Get("ServerName"); 
      item.DatabaseName = hitDocument.Get("DatabaseName"); 
      item.ObjectName = hitDocument.Get("ObjectName"); 
      item.ObjectType = hitDocument.Get("ObjectType"); 
      item.SchemaName = hitDocument.Get("SchemaName"); 
      item.Script = hitDocument.Get("Script"); 
        results.Add(item); 

     } 
     return results; 

}

回答

0

支持的最大術語長度是根據用於IndexWriter.AddDocument文檔16個383字符,和場IndexWriter.MAX_TERM_LENGTH。看起來比這更長的術語會被忽略,導致您描述的問題。

AddDocument的文檔聲明拋出異常,而該字段只是提到將信息寫入infoStream [如果設置了]。

/// <p/>Note that each term in the document can be no longer 
/// than 16383 characters, otherwise an 
/// IllegalArgumentException will be thrown.<p/> 

// [...] 

/// <summary> Absolute hard maximum length for a term. If a term 
/// arrives from the analyzer longer than this length, it 
/// is skipped and a message is printed to infoStream, if 
/// set (see <see cref="SetInfoStream" />). 
/// </summary> 
public static readonly int MAX_TERM_LENGTH; 

來源:IndexWriter.cs