2010-04-26 25 views
2

我使用dtSearch突出顯示文檔內的文本搜索匹配。代碼要做到這一點,減去一些細節和清理,大致是沿着這些線路:使dtSearch突出顯示每個短語一個命中,而不是每個單詞一個短語一個命中

SearchJob sj = new SearchJob(); 
sj.Request = "\"audit trail\""; // the user query 
sj.FoldersToSearch.Add(path_to_src_document); 
sj.Execute(); 
FileConverter fileConverter = new FileConverter(); 
fileConverter.SetInputItem(sj.Results, 0); 
fileConvert.BeforeHit = "<a name=\"HH_%%ThisHit%%\"/><b>"; 
fileConverter.AfterHit = "</b>"; 
fileConverter.Execute(); 
string myHighlightedDoc = fileConverter.OutputString; 

如果我給dtSearch帶引號的短語查詢像

「審計跟蹤」

那麼dtSearch會像下面這樣命中突出顯示:

一個<一個名稱=「HH_0」/ > <b>審計</B > <名稱= 「HH_1」/ > <b>線索</B >是一個有趣的事情有一個<名稱= 「HH_2」/ > <b>審計</B > < a name =「HH_last」/ > <b> trail </b > about!

請注意,該短語的每個單詞單獨突出顯示。相反,我想詞組,並強調作爲一個整體單位,像這樣:

的<名稱=「HH_0」/> <b>審計線索</B >是一個有趣的事情有一個<一name =「HH_last」/ > <b> audit trail </b > about!

這將A)做出突出更好看,B)改善我的javascript的行爲,幫助用戶從撞來撞去瀏覽,以及c)讓總#命中計數更準確。

是否有很好的方法使dtSearch突出顯示短語這種方式?

回答

2

注意:我認爲這裏的文本和代碼可以使用一些更多的工作。如果人們想幫助修改答案或代碼,這可能會成爲社區wiki。

我問了dtSearch這件事情(4/26/2010)。他們的回答是兩部分:

首先,它是而不是可能只是通過改變標誌來獲得所需的突出顯示行爲。

二,它可能獲得一些低級命中信息,其中詞組匹配被視爲整體。特別是如果您在SearchJob中同時設置了dtsSearchWantHitsByWord和dtsSearchWantHitsArray標誌,那麼您的搜索結果將使用偏移量來標註查詢中每個單詞或短語所匹配的詞。例如,如果您的輸入文檔是

審計線索是審計跟蹤有趣的事情!

和您的查詢是

「審計跟蹤」

然後(在.NET API中),sj.Results.CurrentItem.HitsByWord [0]將包含類似的字符串此:

審覈跟蹤(2 11)

表明短語「審計跟蹤」被發現開始,第二個字和文檔中的第11個字。

您可以使用這些信息做的一件事是創建一個「跳過列表」,指示哪個dtSearch亮點不重要(即哪些是短語延續,而不是詞或詞組的開頭)。例如,如果你的跳躍列表是[4,7,9],這可能意味着第4,第7和第9次點擊不顯着,而其他的打擊是合法的。這種A「跳錶」至少可以通過兩種方式使用:

  1. 你可以改變你的代碼,從命中導航命中,使其跳過命中數i當且僅當skipList.contains(I) 。
  2. 根據要求,您可能還能夠改寫由dtSearch FileConverter生成的HTML。在我來說,我有dtSearch註釋的東西擊中一樣<名=「HH_1」/> <跨度類=「亮點」 > hitword </SPAN >,並使用A標籤(而事實上,他們是按順序編號 - HH_1,HH_2,HH_3等)作爲命中導航的基礎。所以我試過了,並取得了一些成功,就是走HTML,然後去掉HH_i中跳轉列表中的所有A標籤。根據您的導航代碼,您可能需要重新編號A標籤,以便在HH_1和HH_3之間沒有任何間隙。

假設這些「跳過列表」確實有用,您將如何生成它們?那麼這裏的一些代碼,主要作品:

using System; 
using System.Collections.Generic; 
using System.IO; 
using System.Text; 
using System.Text.RegularExpressions; 
using NUnit.Framework; 

public class DtSearchUtil 
{ 
    /// <summary> 
    /// Makes a "skip list" for the dtSearch result document with the specified 
    /// WordArray data. The skip list indicates which hits in the dtSearch markup 
    /// should be skipped during hit navigation. The reason to skip some hits 
    /// is to allow navigation to be phrase aware, rather than forcing the user 
    /// to visit each word in the phrase as if it were an independent hit. 
    /// The skip list consists of 1-indexed hit offsets. 2, for example, would 
    /// mean that the second hit should be skipped during hit navigation. 
    /// </summary> 
    /// <param name="dtsHitsByWordArray">dtSearch HitsByWord data. You'll get this from SearchResultItem.HitsByWord 
    /// if you did your search with the dtsSearchWantHitsByWord and dtsSearchWantHitsArray 
    /// SearchFlags.</param> 
    /// <param name="userHitCount">How many total hits there are, if phrases are counted 
    /// as one hit each.</param> 
    /// <returns></returns> 
    public static List<int> MakeHitSkipList(string[] dtsHitsByWordArray, out int userHitCount) 
    { 
     List<int> skipList = new List<int>(); 
     userHitCount = 0; 

     int curHitNum = 0; // like the dtSearch doc-level highlights, this counts hits word-by-word, rather than phrase by phrase 
     List<PhraseRecord> hitRecords = new List<PhraseRecord>(); 
     foreach (string dtsHitsByWordString in dtsHitsByWordArray) 
     { 
      hitRecords.Add(PhraseRecord.ParseHitsByWordString(dtsHitsByWordString)); 
     } 
     int prevEndOffset = -1; 

     while (true) 
     { 
      int nextOffset = int.MaxValue; 
      foreach (PhraseRecord rec in hitRecords) 
      { 
       if (rec.CurOffset >= rec.OffsetList.Count) 
        continue; 

       nextOffset = Math.Min(nextOffset, rec.OffsetList[rec.CurOffset]); 
      } 
      if (nextOffset == int.MaxValue) 
       break; 

      userHitCount++; 

      PhraseRecord longestMatch = null; 
      for (int i = 0; i < hitRecords.Count; i++) 
      { 
       PhraseRecord rec = hitRecords[i]; 
       if (rec.CurOffset >= rec.OffsetList.Count) 
        continue; 
       if (nextOffset == rec.OffsetList[rec.CurOffset]) 
       { 
        if (longestMatch == null || 
         longestMatch.LengthInWords < rec.LengthInWords) 
        { 
         longestMatch = rec; 
        } 
       } 
      } 

      // skip subsequent words in the phrase 
      for (int i = 1; i < longestMatch.LengthInWords; i++) 
      { 
       skipList.Add(curHitNum + i); 
      } 

      prevEndOffset = longestMatch.OffsetList[longestMatch.CurOffset] + 
       (longestMatch.LengthInWords - 1); 

      longestMatch.CurOffset++; 

      curHitNum += longestMatch.LengthInWords; 

      // skip over any unneeded, overlapping matches (i.e. at the same offset) 
      for (int i = 0; i < hitRecords.Count; i++) 
      { 
       while (hitRecords[i].CurOffset < hitRecords[i].OffsetList.Count && 
        hitRecords[i].OffsetList[hitRecords[i].CurOffset] <= prevEndOffset) 
       { 
        hitRecords[i].CurOffset++; 
       } 
      } 
     } 

     return skipList; 
    } 

    // Parsed form of the phrase-aware hit offset stuff that dtSearch can give you 
    private class PhraseRecord 
    { 
     public string PhraseText; 

     /// <summary> 
     /// Offsets into the source text at which this phrase matches. For example, 
     /// offset 300 would mean that one of the places the phrase matches is 
     /// starting at the 300th word in the document. (Words are counted according 
     /// to dtSearch's internal word breaking algorithm.) 
     /// See also: 
     /// http://support.dtsearch.com/webhelp/dtSearchNetApi2/frames.html?frmname=topic&frmfile=dtSearch__Engine__SearchFlags.html 
     /// </summary> 
     public List<int> OffsetList; 

     // BUG: We calculate this with a whitespace tokenizer. This will probably 
     // cause bad results in some places. (Better to figure out how to count 
     // the way dtSearch would.) 
     public int LengthInWords 
     { 
      get 
      { 
       return Regex.Matches(PhraseText, @"[^\s]+").Count; 
      } 
     } 

     public int CurOffset = 0; 

     public static PhraseRecord ParseHitsByWordString(string dtsHitsByWordString) 
     { 
      Match m = Regex.Match(dtsHitsByWordString, @"^([^,]*),\s*\d*\s*\(([^)]*)\).*"); 
      if (!m.Success) 
       throw new ArgumentException("Bad dtsHitsByWordString. Did you forget to set dtsHitsByWordString in dtSearch?"); 

      string phraseText = m.Groups[1].Value; 
      string parenStuff = m.Groups[2].Value; 

      PhraseRecord hitRecord = new PhraseRecord(); 
      hitRecord.PhraseText = phraseText; 
      hitRecord.OffsetList = GetMatchOffsetsFromParenGroupString(parenStuff); 
      return hitRecord; 
     } 

     static List<int> GetMatchOffsetsFromParenGroupString(string parenGroupString) 
     { 
      List<int> res = new List<int>(); 
      MatchCollection matchCollection = Regex.Matches(parenGroupString, @"\d+"); 
      foreach (Match match in matchCollection) 
      { 
       string digitString = match.Groups[0].Value; 
       res.Add(int.Parse(digitString)); 
      } 
      return res; 
     } 
    } 
} 


[TestFixture] 
public class DtSearchUtilTests 
{ 
    [Test] 
    public void TestMultiPhrasesWithoutFieldName() 
    { 
     string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706);", 
      @"bana*, 4 (490 505 689 713)" 
      }; 

     // expected dtSearch hit order: 
     // 0: [email protected] 
     // 1: [email protected] [should skip] 
     // 2: [email protected] 
     // 3: [email protected] 
     // 4: [email protected] [should skip] 
     // 5: [email protected] 
     // 6: [email protected] 
     // 7: [email protected] [should skip] 
     // 8: [email protected] 
     // 9: [email protected] [should skip] 
     // 10: [email protected] 
     // 11: [email protected] [should skip] 
     // 12: [email protected] 
     // 13: [email protected] [skip] 
     // 14: [email protected] 
     // 15: [email protected] 
     // 16: [email protected] [skip] 
     // 17: [email protected] 

     int userHitCount; 
     List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount); 

     Assert.AreEqual(11, userHitCount); 

     Assert.AreEqual(1, skipList[0]); 
     Assert.AreEqual(4, skipList[1]); 
     Assert.AreEqual(7, skipList[2]); 
     Assert.AreEqual(9, skipList[3]); 
     Assert.AreEqual(11, skipList[4]); 
     Assert.AreEqual(13, skipList[5]); 
     Assert.AreEqual(16, skipList[6]); 
     Assert.AreEqual(7, skipList.Count); 
    } 

    [Test] 
    public void TestPhraseOveralap1() 
    { 
     string[] foo = { @"apple pie, 7 (482 499 552);", 
      @"apple, 4 (482 490 499 552)" 
      }; 

     // expected dtSearch hit order: 
     // 0: [email protected] 
     // 1: [email protected] [should skip] 
     // 2: [email protected] 
     // 3: [email protected] 
     // 4: [email protected] [should skip] 
     // 5: [email protected] 
     // 6: [email protected] [should skip] 

     int userHitCount; 
     List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount); 

     Assert.AreEqual(4, userHitCount); 

     Assert.AreEqual(1, skipList[0]); 
     Assert.AreEqual(4, skipList[1]); 
     Assert.AreEqual(6, skipList[2]); 
     Assert.AreEqual(3, skipList.Count); 
    } 

    [Test] 
    public void TestPhraseOveralap2() 
    { 
     string[] foo = { @"apple pie, 7 (482 499 552);", 
@"pie, 4 (483 490 500 553)" 
    }; 

     // expected dtSearch hit order: 
     // 0: [email protected] 
     // 1: [email protected] [should skip] 
     // 2: [email protected] 
     // 3: [email protected] 
     // 4: [email protected] [should skip] 
     // 5: [email protected] 
     // 6: [email protected] [should skip] 

     int userHitCount; 
     List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount); 

     Assert.AreEqual(4, userHitCount); 

     Assert.AreEqual(1, skipList[0]); 
     Assert.AreEqual(4, skipList[1]); 
     Assert.AreEqual(6, skipList[2]); 
     Assert.AreEqual(3, skipList.Count); 
    } 

    // TODO: test "apple pie" and "apple", plus "apple pie" and "pie" 

    // "subject" should not freak it out 
    [Test] 
    public void TestSinglePhraseWithFieldName() 
    { 
     string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706), subject" }; 

     int userHitCount; 
     List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount); 

     Assert.AreEqual(7, userHitCount); 

     Assert.AreEqual(7, skipList.Count); 
     Assert.AreEqual(1, skipList[0]); 
     Assert.AreEqual(3, skipList[1]); 
     Assert.AreEqual(5, skipList[2]); 
     Assert.AreEqual(7, skipList[3]); 
     Assert.AreEqual(9, skipList[4]); 
     Assert.AreEqual(11, skipList[5]); 
     Assert.AreEqual(13, skipList[6]); 
    } 
} 
相關問題