Equation OCR Tutorial Part 2: Training characters with Tesseract OCR

Categories Computer Vision, Uncategorized
I’ll be doing a series on using OpenCV and Tesseract to take a scanned image of an equation and be able to read it in and graph it and give related data. I was surprised at how well the results turned out =)

I will be using versions OpenCV 2.4.2 and Tesseract OCR 3.02.02.

 I have also made two tutorials on installing Teseract and OpenCV for Vista x86 on Microsoft Visual Studio 2008 Express. However, you can go on the official sites for official documentation on installing the libraries on your system.

Parts

Equation OCR Part 1: Using contours to extract characters in OpenCV
Equation OCR Part 2: Training characters with Tesseract OCR
Equation OCR Part 3: Equation OCR

Tutorials

Installing OpenCV: http://blog.ayoungprogrammer.com/2012/10/tutorial-install-opencv-242-for-windows.html/

Installing Tesseract: http://blog.ayoungprogrammer.com/2012/11/tutorial-installing-tesseract-ocr-30202.html/

Official Links:

OpenCV : http://opencv.org/
Tesseract OCR: http://code.google.com/p/tesseract-ocr/

Overview:

The overall goal of the final program is to be able to convert the image of an equation into a text equation that we will be able to graph. We can break down this project into three parts, extracting characters from text, training for the OCR and recognition for converting images of equations into text.

Training

We will split the training process into two parts: classifying and Tesseracting. In classifying, we will use the extraction method in part one to create a program to generate training data for Tesseract. We will extract characters and have a user identify the character to be classified. The characters will go in folders labelled with the character name. For example all the 9’s will go in the “9” folder and all the x’s will go in the “x” folder. For the Tesseracting part, we will take our training data and run through the Tesseract training process so that the data can be used for OCR. 
Classifying

Classifying will take the longest time because the training data will need about 10 samples of each character. Our characters are the digits 0 to 9, left bracket, right bracket, plus signs and x. We will ignore dashes because they can be easily recognized as shapes with width three times greater than length. We can also ignore equal signs because they are just two dashes on top of another. With some slight modifications to the extraction program in part 1, we can make a training program for this. The training data I took required about 30 different images of equations. 

Classifying source:

http://pastebin.com/iJQsPh9L

Tesseracting
The original Tesseract training method is confusing to understand in their documentation and their method of training is very tedious. Their recommended training method consists of giving sample images and also in another data file, indicate the symbol and rectangle that corresponds to the character in the image. This as you can imagine becomes very tedious as you will need to find the coordinates and dimensions of the rectangle to the corresponding character. However, there are a few online GUI tools you can use to help with the process. I, on the other hand am very lazy and did not want to go through a hundred rectangles so I made a program that will generate an image with the training data and also generate the corresponding rectangles. The final result is something like this:

Source code here:

Now that we have finished created the training boxes, we can feed the results into the Tesseract engine for it to learn how to recognize the characters. Open up command prompt and go to the folder where your .tif file and file containing the rectangle data. Type in tesseract and hit enter. If it says command not found it means you did not install Tesseract properly.

To start the training: (mat for math)
tesseract mat.arial.exp0.tif mat.arial.exp0 nobatch box.train

Now you will see that Tesseract has generated a file called mat.arial.exp0.tr. Don’t touch the file. Next we will have to tell Tesseract which possible characters we are using. This can be generated by running:
uncharset_extractor mat.arial.exp0.box

Create a new file called font_properties (no file type like the unicharset, I just copied the unicharset file and save it under a new name called font_properties). Do not use notepad as it will mess up formatting. Use something like WordPad. Inside font_properties type in:

arial 1 0 0 1 0

Next to start mftraining:

mftraining -F font_properties -U unicharset mat.arial.exp0.tr

Shape clustering:
shapeclustering -F font_properties -U unicharset mat.arial.exp0.tr

mftraining again for shapetables:
mftraining -F font_properties -U unicharset mat.arial.exp0.tr

cntraining for clustering:

Now we have to combine all these files into one file. Now rename all the following files:

inttemp -> mat.inttemp
shapetable -> mat.shapetable
normproto -> mat.normproto
pffmtable -> mat.pffmtable
unicharset -> mat.unicharset

To generate your new tess data file:
combine_tessdata mat.

The final generated file is mat.traineddata. Move this file into the tessdata folder in the Tesseract installation folder so that the Tesseract library can access it -> C:Program FilesTesseract-OCRtessdata

To test go into one of your test data folders like “1” and run tesseract with your language file:

tesseract 1.jpg output -l mat -psm 10

In the output file you should see the character “1”. Congratulations, you have just trained your first OCR language!

Source codes:

Classifying characters:

http://pastebin.com/iJQsPh9L

Generating Tesseract training data:

11 Comments

  • eileen li
    June 12, 2013

    Hi i need your help,
    i followed your tutorial and generated the box file and training image. however when i run tesseract to train it i get this error in the command prompt:

    tesseract open source ocr engine v3.02 with leptonica
    empty page!!
    empty page!!
    cannot create output file eng.digital-7.exp0.txt

    do you have any idea what is this error and how to solve it?

    • son cho
      November 20, 2015

      dongtam
      game mu
      http://nhatroso.net/
      http://nhatroso.com/
      nhac san cuc manh
      tư vấn luật
      dịch vụ thành lập công ty trọn gói
      văn phòng luật
      tổng đài tư vấn pháp luật
      thành lập công ty
      http://we-cooking.com/
      chém gió
      trung tâm ngoại ngữ
      Mặt đất chấn động, cả mặt đất bắt đầu da nẻ, mãnh liệt chấn động làm cho mãnh thú luống cuống, chung quanh che trời đại thụ còn có cự phong cự thạch bắt đầu sụp đổ, trong lúc nhất thời phong vân biến sắc.

      "Xem, lại là thiên địa dị tượng."

      Xa xa có người ở kinh ngạc nói, cả giữa không trung lúc này đã muốn là bị một mảnh hồng mang che, ngay tại ngàn dậm ở ngoài, một đạo thật lớn hồng mang phóng lên cao, mãnh liệt huyết X hương vị làm cho không ít đều cảm giác được không khoẻ.

      Mà loại hơi thở làm cho Ma đạo người trung gian cảm thấy được dị thường thoải mái, thăng tới là cảm giác trong cơ thể ma công đều có sở tăng cường bình thường, đầy trời hồng mang bao phủ hạ, vô số hung thủ ngẩng đầu thét dài, trong lúc nhất thời đinh tai nhức óc.

      Nhạc Thành nhìn thấy kia đến hồng mang phóng lên cao địa phương, tựa hồ chính là Trấn Ma Đạo nhân hoà Huyễn Linh đạo nhân bọn hắn nói bảo tàng nơi, không lại là trong lòng có đó cảm thấy được kinh ngạc, chẳng lẽ trong hai cái thật là có chút liên hệ có thể nào.

  • jothis reghunadh
    September 5, 2013

    hi could you please tell me from where you learned tesseract programming..

  • Anthony Tresontani
    September 16, 2013

    Hi Michael,

    Thanks for the great post.

    We also think the documentation of tesseract is really confusing. Therefore, we created a paid service to help people training their ocr, it's called: http://www.tesseract-training.com/

  • madhur r
    October 29, 2013

    Why tesseract is giving misspell the letters and change o to 0 and 0 to O vice versa? Is it possible to correct it or improve it? can any one help?

    • ayoungprogrammer
      October 29, 2013

      As you can probably see, o 0 are very similar characters. Training it more may help in recognition. Tesseract OCR still has a long way to go before accomplishing near perfection.

  • Epost Address
    February 8, 2014

    Hi Michael,

    Can you please post the contents of your "mat.arial.exp0" file? This is a great project.

  • gunshi gupta
    October 4, 2015

    Hi michael,
    great post.
    I just had a couple of doubts though, has the part where you've rotated the points about a calculated angle been done to deskew the image?
    and wouldn't approximating the contours as polygons decrease recognition accuracy?
    Thanks!

    • ayoungprogrammer
      October 10, 2015

      The images I took were straight so I did not have to deskew. Recognition accuracy was not a problem for me since I use high resolution images. However, I do not think this solution would be as successful with lower quality images.

  • Unknown
    November 13, 2015

    Hey Michael,

    Clearly this project resonated with a bunch of people – cool!

    Me and my company (https://socratic.org) are working on an education app that hopes to incorporate the kind of stuff you did on this project – detecting math in photos and returning useful results like Wolfram Alpha.

    We're new to OCR, and if you're up for it, it would help us a lot if we could do a quick hangout and ask you some questions. I'd be happy to compensate you for your time.

    Thanks,
    Shreyans

    PS: We've got two Torontonians (including me), and one recent Waterloo grad on our team! He was in Velocity.
    PPS: You've got an open invitation to come hang out and work from our office if you're ever in New York.

Leave a Reply to ayoungprogrammer Cancel reply

Your email address will not be published. Required fields are marked *