2014-02-06 107 views
1

我試圖使用Levenshtein Distance的幫助來在OCR頁面上查找模糊關鍵字(靜態文本)。
要做到這一點,我想給出一個允許的錯誤百分比(比如15%)。模糊匹配字符串中的多個單詞

string Keyword = "past due electric service"; 

由於關鍵字是25個字符長,我想允許4個錯誤(25 * 0.15四捨五入)
我需要能夠比較它...

string Entire_OCR_Page = "previous bill amount payment received on 12/26/13 thank 
          you! current electric service total balances unpaid 7 
          days after the total due date are subject to a late 
          charge of 7.5% of the amount due or $2.00, whichever/5 
          greater. " 

這是我怎麼做,現在......

int LevenshteinDistance = LevenshteinAlgorithm(Keyword, Entire_OCR_Page); // = 202 
int NumberOfErrorsAllowed = 4; 
int Allowance = (Entire_OCR_Page.Length() - Keyword.Length()) + NumberOfErrorsAllowed; // = 205 

顯然,Keyword沒有在OCR_Text找到(它不應該)。但是,使用Levenshtein的距離,錯誤的數量少於15%的餘地(因此我的邏輯表示它被發現)。

有誰知道更好的方法來做到這一點?

+0

發佈了一個更好的問題。 http://goo.gl/Rb6ejp – Milne

回答

1

使用子字符串回答了我的問題。如果其他人遇到相同類型的問題,則發帖。有點非正統,但它對我很好。

int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have. 
int LowestLevenshteinNumber = 999999; //initialize insanely high maximum 
decimal PossibleStringLength = (PossibleString.Length); //Length of string to search 
decimal StaticTextLength = (StaticText.Length); //Length of text to search for 
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance/100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage 

    //Look for best match with 1 less character than it should have, then the correct amount of characters. 
    //And last, with 1 more character. (This is because one letter can be recognized as 
    //two (W -> VV) and visa versa) 

for (int i = 0; i < 3; i++) 
{ 
    for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++) 
    { 
     string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer)); 
     int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero)); 
     int lNumber = LevenshteinAlgorithm(StaticText, possibleResult); 

     if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber))) 
     { 
      PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber }); 
      LowestLevenshteinNumber = lNumber; 
     } 
    } 
    TextLengthBuffer++; 
} 




public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm 
{ 
    int n = s.Length; 
    int m = t.Length; 
    int[,] d = new int[n + 1, m + 1]; 

    if (n == 0) 
    { 
     return m; 
    } 

    if (m == 0) 
    { 
     return n; 
    } 

    for (int i = 0; i <= n; d[i, 0] = i++) 
    { 
    } 

    for (int j = 0; j <= m; d[0, j] = j++) 
    { 
    } 

    for (int i = 1; i <= n; i++) 
    { 
     for (int j = 1; j <= m; j++) 
     { 
      int cost = (t[j - 1] == s[i - 1]) ? 0 : 1; 

      d[i, j] = Math.Min(
       Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1), 
       d[i - 1, j - 1] + cost); 
     } 
    } 
    return d[n, m]; 
} 
0

我認爲它不工作,因爲你的字符串的大塊是匹配的。所以我會做的是嘗試將你的關鍵詞分成不同的單詞。

然後在您的OCR_TEXT中找到所有匹配這些詞的地方。

然後看看他們匹配的所有地方,看看這些地方中是否有4個地方是連續的,並且匹配原始短語。

我不確定我的解釋是否清楚?

+0

如果我正確理解你的答案,我將失去聲明NumberOfErrorsAllowed的能力。沒有? – Milne

+0

是,否;這將是每個字。 –

+0

每個單詞都不起作用。一個詞可以是「我」,如果它被識別爲「1」,我就會失去結果。看到我想出的答案。謝謝 – Milne