2014-11-24 19 views
0

我有存儲在一個List對象詞的集合,比方說是標題集合在這裏
獲取從給定的字集字校對

Lorem Ipsum 
Centuries 
Electronic 

,這是樣本段落,我想看看這個字
lorem ipsum只是印刷和排版行業的虛擬文本。 Loren Ipsum自從十六世紀以來一直是工業標準的虛擬文本,當時一臺不知名的打印機採用了一種類型的廚房,並將其製作成樣本書。它不僅存活了五個世紀,而且還實現了電子排版的飛躍,基本保持不變。它在20世紀60年代隨着包含LorenIpsum段落的Letraset工作表的發佈以及最近的包括Aldus PageMaker在內的桌面出版軟件(包括LoremIpsum的版本)而得到推廣。

我的目標是,我想提取那段中的單詞,如果因拼寫錯誤而導致拼寫錯誤,因此拼寫錯誤無關緊要。

我預期的結果這裏是

lorem ipsum 
Loren Ipsum 
centuries 
electornic 
LorenIpsum 
LoremIpsum 

但不限於這些,因爲這將運行到整個文章與文章hundrends

對不起,我沒有任何書面的代碼,但但我打算在這裏使用RegEx for C#。

+0

你是什麼意思,不限於這些? – hwnd 2014-11-24 04:33:17

+0

推薦閱讀:http://norvig.com/spell-correct.htm – Blorgbeard 2014-11-24 04:35:20

+1

實現spllchecking可能是有點過於寬泛,但你可以在這裏http://stackoverflow.com/questions/2344320/comparing-strings-with-啓動容忍 – 2014-11-24 04:35:46

回答

0

有互聯網上提供很多的算法,兩個詞之間的相似性檢查。 GetEdits就是其中之一。

以下代碼可被使用。但它可能不是很有效。

static int GetEdits(string answer, string guess) 
{ 
    guess = guess.ToLower(); 
    answer = answer.ToLower(); 

    int[,] d = new int[answer.Length + 1, guess.Length + 1]; 
    for (int i = 0; i <= answer.Length; i++) 
     d[i, 0] = i; 
    for (int j = 0; j <= guess.Length; j++) 
     d[0, j] = j; 
    for (int j = 1; j <= guess.Length; j++) 
     for (int i = 1; i <= answer.Length; i++) 
      if (answer[i - 1] == guess[j - 1]) 
       d[i, j] = d[i - 1, j - 1]; //no operation 
      else 
       d[i, j] = Math.Min(Math.Min(
        d[i - 1, j] + 1, //a deletion 

        d[i, j - 1] + 1), //an insertion 

        d[i - 1, j - 1] + 1 //a substitution 

       ); 
    return d[answer.Length, guess.Length]; 
} 

static void Main(string[] args) 
{ 
    const string text = @"lorem ipsum is simply dummy text of the printing and typesetting industry. Loren Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing LorenIpsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of LoremIpsum."; 

    var findWords = new string[] 
    { 
     "Lorem Ipsum", 
     "Centuries", 
     "Electronic" 
    }; 

    const int MaxErrors = 2; 

    // Tokenize text 
    var tokens = text.Split(' ', ',' , '.'); 

    for (int i = 0; i < tokens.Length; i++) 
    { 
     if(tokens[i] != String.Empty) 
     { 
      foreach (var findWord in findWords) 
      { 
       if (GetEdits(findWord, tokens[i]) <= MaxErrors) 
       { 
        Console.WriteLine(tokens[i]); 
        break; 
       } 
       // Join with the next word and check again. 
       else if(findWord.Contains(' ') && i + 1 < tokens.Length) 
       { 
        string token = tokens[i] + " " + tokens[i + 1]; 
        if (GetEdits(findWord, token) <= MaxErrors) 
        { 
         Console.WriteLine(token); 
         i++; 
         break; 
        } 
       } 
      } 
     } 
    } 
}