下面是使用TessNet2(OCR框架)掃描TessNet2中內置的OCR功能捕獲的單詞列表的功能。由於我以低於完美質量掃描的頁面檢測到的單詞不是100%準確的。OCR字識別邏輯
所以有時會混淆'S'和'5'或'l''1'。另外,它不考慮大小寫。所以我必須尋找這兩種情況。
它的工作方式是我在紙上搜索某些彼此接近的單詞。所以第一組單詞[I]是「抽象服務訂購」。如果頁面包含相鄰的這些單詞,則會移至下一組單詞[j],然後是下一個[h]。如果頁面包含全部3組單詞,則它返回true。
這是我想過的最好的方法,但我希望這裏有人能給我另一種嘗試的方式。
public Boolean isPageABSTRACTING(List<tessnet2.Word> wordList)
{
for (int i = 0; i < wordList.Count; i++) //scan through words
{
if ((wordList[i].Text == "Abstracting" || wordList[i].Text == "abstracting" || wordList[i].Text == "abstractmg" || wordList[i].Text == "Abstractmg" && wordList[i].Confidence >= 50) && (wordList[i + 1].Text == "Service" || wordList[i + 1].Text == "service" || wordList[i + 1].Text == "5ervice" && wordList[i + 1].Confidence >= 50) && (wordList[i + 2].Text == "Ordered" || wordList[i + 2].Text == "ordered" && wordList[i + 2].Confidence >= 50)) //find 1st tier check
{
for (int j = 0; j < wordList.Count; j++) //scan through words again
{
if ((wordList[j].Text == "Due" || wordList[j].Text == "Oue" && wordList[j].Confidence >= 50) && (wordList[j + 1].Text == "Date" || wordList[j + 1].Text == "Oate" && wordList[j + 1].Confidence >= 50) && (wordList[j + 2].Text == "&" && wordList[j + 2].Confidence >= 50)) //find 2nd tier check
{
for (int h = 0; h < wordList.Count; h++) //scan through words again
{
if ((wordList[h].Text == "Additional" || wordList[h].Text == "additional" && wordList[h].Confidence >= 50) && (wordList[h + 1].Text == "comments" || wordList[h + 1].Text == "Comments" && wordList[h + 1].Confidence >= 50) && (wordList[h + 2].Text == "about" || wordList[h + 2].Text == "About" && wordList[h + 2].Confidence >= 50) && (wordList[h + 3].Text == "this" || wordList[h + 3].Text == "This" && wordList[h + 3].Confidence >= 50)) //find 3rd tier check
{
return true;
}
}
}
}
}
}
return false;
}
你介意解釋一下代碼。當我運行這個時,'文本'變成OCR結果中所有文本的字符串,但有幾個單詞有「DONTMATCH」。我假設這是因爲信心不是大於50. – MaylorTaylor
試圖添加更多的解釋在... –