火車tesseract爲印地文語言

我想訓練我的tesseract爲印地語。我有許多'印地文'寫有特定字體的文字圖像，我想爲這些圖像訓練tesseract ocr。幾次我嘗試使用此鏈接火車tesseract https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3。當我運行makebox命令時，它會提取框文件，但它識別爲英文字符。我不明白爲什麼會發生這種情況。請幫助我培訓tesseract ocr印地語。您可以通過以下鏈接查看樣本圖片。 sample file 火車tesseract爲印地文語言

來源

2014-12-20 Prasad Bhosale

我一直想自己培養幾個字符集，並且一直在收集信息。也許這個信息對你也是有用的。

你閱讀本文件：

http://blog.cedric.ws/how-to-train-tesseract-301

如果沒有人物的認可，你將不得不培訓所有的人物，我害怕。但重要的步驟似乎是：

包括在makebox命令行語言（「工程」）的指示（這很可能是「欣」你的情況
做到心中有數。該版本的Tesseract的。我的印象中，訓練過程在過去的版本中已發生變化。的

來源

2014-12-20 16:00:49 jcoppens

示例程序從圖像識別Hindi char和存儲各個邊界框的值和相應的印地文字符存入o ne文件。

/* 
* Char_OCR.cpp 
* 
* Created on: Jun 23, 2016 
*  Author: pratik 
*/ 

#include <opencv2/opencv.hpp> 
#include <tesseract/baseapi.h> 
#include <leptonica/allheaders.h> 
#include <iostream> 
#include <fstream> 

using namespace std; 
using namespace cv; 

void dumpIntoFile(const char *ocrResult , ofstream &myfile1 ,int x1, int y1, 
     int x2, int y2, int &); 

int main(int argc ,char **argv) 
{ 

    Pix *image = pixRead(argv[1]); 

    if (image == 0) { 
     cout << "Cannot load input file!\n"; 
    } 

    tesseract::TessBaseAPI tess; 


    if (tess.Init("/usr/share/tesseract/tessdata", "hin")) { 
      fprintf(stderr, "Could not initialize tesseract.\n"); 
      exit(1); 
     } 

    tess.SetImage(image); 
    tess.Recognize(0); 

    tesseract::ResultIterator *ri = tess.GetIterator(); 
    tesseract::PageIteratorLevel level = tesseract::RIL_SYMBOL; 

    cout << ri << endl; 

    ofstream myfile1("Word.txt"); 

    myfile1 << "ID" << '\t' << "CORD_X" << '\t' << "CORD_Y" << '\t' << 
      "CORD_W" << '\t' << "CORD_H" << '\t' << "STRING" << endl; 

    int i=1; 

    if(ri!=0) 
    { 
     do { 
      const char *word = ri->GetUTF8Text(level); 
//   cout << word << endl; 

      //float conf = ri->Confidence(level); 
      int x1, y1, x2, y2; 
      ri->BoundingBox(level, &x1, &y1, &x2, &y2); 

      dumpIntoFile(word, myfile1, x1, y1, x2, y2, i); 

      delete []word; 

     } while (ri->Next(level)); 

     delete []ri; 
    } 

} 

void dumpIntoFile(const char *ocrResult , ofstream &myfile1 ,int x1, int y1, 
     int x2, int y2,int &i) 
{ 

      int length = strlen(ocrResult); 

       myfile1 << i++ << '\t' << x1 << '\t' << y1 << '\t' << 
         x2 << '\t' << y2 << '\t' ; 

       //cout << "in the string (" << length << ") ::"; 
       for(int j = 0; j < length && ocrResult[j] != '\n'; j++) 
       { 
        myfile1 << ocrResult[j]; 
       } 

       myfile1 << endl; 

}

來源

2016-06-23 13:39:55

如果你想要比這更精確，那麼你可以通過pixeRead（）中的OTSU閾值圖像。我正在pixRead（）中傳遞正常圖像。通過OTSU閾值圖像。我爲此開發了算法。。讓我知道是否有人想要。 –

火車tesseract爲印地文語言

回答

相關問題