2011-07-14 45 views
0

在我的項目中,我使用Lucence實現了全文索引搜索。但是,在做這件事時,我堅持用邏輯來區分Lucene布爾運算符與Normal和/或不是單詞。如何指定Lucene.net布爾邏輯AND,OR,而不是來自正常和/或不是變量的運算符?

假設例如,如果我們正在搜索「我想要一支筆和鉛筆」,但默認情況下Lucene.net搜索Lucene OR操作。所以它會搜索像「我或想要一個OR筆或鉛筆」不喜歡我想有什麼想「我或想要一個或筆或OR和或鉛筆」。那麼,我們如何區分一個正常的,或不是來自Lucene運營商?

爲此,我已經做了,它看起來像

/// <summary> 
    /// Method to get search predicates 
    /// </summary> 
    /// <param name="searchTerm">Search term</param> 
    /// <returns>List of predicates</returns> 
    public static IList<string> GetPredicates(string searchTerm) 
    { 
     //// Remove unwanted characters 
     //searchTerm = Regex.Replace(searchTerm, "[<(.|\n)*?!'`>]", string.Empty); 
     string exactSearchTerm = string.Empty, 
       keywordOrSearchTerm = string.Empty, 
       andSearchTerm = string.Empty, 
       notSearchTerm = string.Empty, 
       searchTermWithOutKeywords = string.Empty; 
     //// Exact search tern 
     exactSearchTerm = "\"" + searchTerm.Trim() + "\""; 
     //// Search term without keywords 
     searchTermWithOutKeywords = Regex.Replace(
      searchTerm, " and not | and | or ", " ", RegexOptions.IgnoreCase); 
     //// Splioted keywords 
     string[] splittedKeywords = searchTermWithOutKeywords.Trim().Split(
      new char[] { ' ', ',' }, StringSplitOptions.RemoveEmptyEntries); 
     //// Or search term 
     keywordOrSearchTerm = string.Join(" OR ", splittedKeywords); 
     //// And search term 
     andSearchTerm = string.Join(" AND ", splittedKeywords); 
     //// not search term 
     int index = 0; 
     List<string> searchTerms = (from term in Regex.Split(
             searchTerm, " and not ", RegexOptions.IgnoreCase) 
             where index++ != 0 
             select term).ToList(); 
     searchTerms = (from term in searchTerms 
       select Regex.IsMatch(term, " and | or ", RegexOptions.IgnoreCase) ? 
       Regex.Split(term, " and | or ", RegexOptions.IgnoreCase).FirstOrDefault() : 
       term).ToList(); 
     notSearchTerm = searchTerms.Count > 0 ? string.Join(" , ", searchTerms) : "\"\""; 
     return new List<string> { exactSearchTerm, andSearchTerm, keywordOrSearchTerm, notSearchTerm }; 
    } 

一個輔助方法,但它會返回四個結果。所以我必須通過我的索引循環4次,但它似乎是非常忙碌的。那麼任何人都可以在一個循環中解決這個問題嗎?

回答

1

內置的StandardAnalyzer將爲您排除常見單詞,有關說明,請參閱this article

+0

好建議。 +1 –

0

像@Matt沃倫建議,lucene有所謂的「停用詞」,通常對搜索質量沒有多大價值,但使索引變得龐大而臃腫。像「a,and,或,an」這樣的StopWords通常會在您的文本編入索引時自動過濾出來,然後在解析時將其從查詢中濾除。 StopFilter在這兩種情況下都可以應對此行爲,但您可以選擇不使用StopFilter的分析器。

另一個問題是查詢解析。如果我沒有記錯,lucene查詢解析器只會將大寫字母ORANDNOT作爲關鍵字,所以如果用戶輸入全部大寫字母,則需要用小寫字母替換,以免將其視爲操作符。這裏有一些Regex.Replace代碼爲:

string queryString = "the red pencil and blue pencil are both not green or brown"; 
queryString = 
    Regex.Replace (
     queryString, 
     @"\b(?:OR|AND|NOT)\b", 
     m => m.Value.ToLowerInvariant()); 
相關問題