Tesseract – ayoungprogrammer's blog

Equation OCR Tutorial Part 3: Making an OCR for Equations using OpenCV and Tesseract

Computer Vision, UncategorizedJanuary 14, 2013

I’ll be doing a series on using OpenCV and Tesseract to take a scanned image of an equation and be able to read it in and graph it and give related data. I was surprised at how well the results turned out =)

I will be using versions OpenCV 2.4.2 and Tesseract OCR 3.02.02.

I have also made two tutorials on installing Teseract and OpenCV for Vista x86 on Microsoft Visual Studio 2008 Express. However, you can go on the official sites for official documentation on installing the libraries on your system.

Parts

Equation OCR Part 1: Using contours to extract characters in OpenCV
Equation OCR Part 2: Training characters with Tesseract OCR
Equation OCR Part 3: Equation OCR

Tutorials

Installing OpenCV: http://blog.ayoungprogrammer.com/2012/10/tutorial-install-opencv-242-for-windows.html/

Installing Tesseract: http://blog.ayoungprogrammer.com/2012/11/tutorial-installing-tesseract-ocr-30202.html/

Official Links:

OpenCV : http://opencv.org/
Tesseract OCR: http://code.google.com/p/tesseract-ocr/

Overview:

The overall goal of the final program is to be able to convert the image of an equation into a text equation that we will be able to graph. We can break down this project into three parts, extracting characters from text, training for the OCR and recognition for converting images of equations into text.

Recognition

Recognition is easy once we have the training files we need for Tesseract. To initialize for our language and set recognition mode for characters:

tess_api.Init(“”, “mat”, tesseract::OEM_DEFAULT);

tess_api.SetPageSegMode(static_cast<tesseract::PageSegMode>(10));

After extracting all the characters we can use Tesseract on those single characters to get the recognized character.

OpenCV uses a different data storage type from Tesseract but we can easily extract the raw data from a Mat to Tesseract.

tess_api.TesseractRect( resizedPic .data, 1, resizedPic .step1(), 0, 0, resizedPic .cols, resizedPic .rows);

tess_api.SetImage(resizedPic .data,resizedPic.size().width,resizedPic .size().height,resizedPic .channels(),resizedPic .step1());

tess_api.Recognize(0);

const char* out=tess_api.GetUTF8Text();

In the output we should find a character for the recognized character. Since the characters have been sorted from left to right we can just append all these recognized characters into a string stream and output the final results.

Exponents

In a polynomial there are variables (x) , numbers brackets and exponents. The exponents can easily be found by checking if the bottom of a character reaches 2/3 of the way down to the bottom. If it doesn’t than it is probably superscript and we can put a ^ in front of the number to signify an exponent.The green line shows the 2/3 line to check. As you can see all the standard characters that are not exponents will go past the 2/3 line.

Wolfram

To send the equation to Wolfram Alpha I had to reverse the URL format they use which was quite simple. All URL’s begin with : “http://www.wolframalpha.com/input/?i=”. Numbers and letters map to themselves but other characters map to hexcodes:

if(eqn[i]==’+’)url<<“%2B”;

if(eqn[i]==’^’)url<<“%5E”;

if(eqn[i]==’=’)url<<“%3D”;

if(eqn[i]=='(‘)url<<“%28”;

if(eqn[i]==’)’)url<<“%29”;

Extensions

The program can be extended to work for other functions such as log, sin, cos, etc by doing some additional training for letters. It can also be extended to work for fraction bars although it takes some more work. You first look for any “bars” which are any shapes with width 3 times greater than length and you also check if there are shapes above and below the bar. When you do this, you want to take the longest bar first because you want to find the largest fraction first. Then you can recursively find fractions in the numerator and denominator of the fraction going from largest fraction to smallest fraction. Then you can just append to the string (numerator) / (denominator). However, there may be other terms that are not fractions to the left and right of the fraction and you will need to resort by x-coordinates.

Conclusion

In finishing this tutorial I hope you have learned how to use OCR and contours extraction as I certainly have. If you release any extensions of programs through my tutorials I hope you will credit me and also give me message. Thanks for reading!

Source code

http://pastebin.com/fvq1JGsW

Equation OCR Tutorial Part 2: Training characters with Tesseract OCR

Computer Vision, UncategorizedJanuary 13, 2013

I’ll be doing a series on using OpenCV and Tesseract to take a scanned image of an equation and be able to read it in and graph it and give related data. I was surprised at how well the results turned out =)

I will be using versions OpenCV 2.4.2 and Tesseract OCR 3.02.02.

I have also made two tutorials on installing Teseract and OpenCV for Vista x86 on Microsoft Visual Studio 2008 Express. However, you can go on the official sites for official documentation on installing the libraries on your system.

Parts

Equation OCR Part 1: Using contours to extract characters in OpenCV
Equation OCR Part 2: Training characters with Tesseract OCR
Equation OCR Part 3: Equation OCR

Tutorials

Installing OpenCV: http://blog.ayoungprogrammer.com/2012/10/tutorial-install-opencv-242-for-windows.html/

Installing Tesseract: http://blog.ayoungprogrammer.com/2012/11/tutorial-installing-tesseract-ocr-30202.html/

Official Links:

OpenCV : http://opencv.org/
Tesseract OCR: http://code.google.com/p/tesseract-ocr/

Overview:

The overall goal of the final program is to be able to convert the image of an equation into a text equation that we will be able to graph. We can break down this project into three parts, extracting characters from text, training for the OCR and recognition for converting images of equations into text.

Training

We will split the training process into two parts: classifying and Tesseracting. In classifying, we will use the extraction method in part one to create a program to generate training data for Tesseract. We will extract characters and have a user identify the character to be classified. The characters will go in folders labelled with the character name. For example all the 9’s will go in the “9” folder and all the x’s will go in the “x” folder. For the Tesseracting part, we will take our training data and run through the Tesseract training process so that the data can be used for OCR.

Classifying

Classifying will take the longest time because the training data will need about 10 samples of each character. Our characters are the digits 0 to 9, left bracket, right bracket, plus signs and x. We will ignore dashes because they can be easily recognized as shapes with width three times greater than length. We can also ignore equal signs because they are just two dashes on top of another. With some slight modifications to the extraction program in part 1, we can make a training program for this. The training data I took required about 30 different images of equations.

Classifying source:

http://pastebin.com/iJQsPh9L

Tesseracting

The original Tesseract training method is confusing to understand in their documentation and their method of training is very tedious. Their recommended training method consists of giving sample images and also in another data file, indicate the symbol and rectangle that corresponds to the character in the image. This as you can imagine becomes very tedious as you will need to find the coordinates and dimensions of the rectangle to the corresponding character. However, there are a few online GUI tools you can use to help with the process. I, on the other hand am very lazy and did not want to go through a hundred rectangles so I made a program that will generate an image with the training data and also generate the corresponding rectangles. The final result is something like this:

Source code here:

http://pastebin.com/9NNk0uMB

Now that we have finished created the training boxes, we can feed the results into the Tesseract engine for it to learn how to recognize the characters. Open up command prompt and go to the folder where your .tif file and file containing the rectangle data. Type in tesseract and hit enter. If it says command not found it means you did not install Tesseract properly.

To start the training: (mat for math)
tesseract mat.arial.exp0.tif mat.arial.exp0 nobatch box.train

Now you will see that Tesseract has generated a file called mat.arial.exp0.tr. Don’t touch the file. Next we will have to tell Tesseract which possible characters we are using. This can be generated by running:
uncharset_extractor mat.arial.exp0.box

Create a new file called font_properties (no file type like the unicharset, I just copied the unicharset file and save it under a new name called font_properties). Do not use notepad as it will mess up formatting. Use something like WordPad. Inside font_properties type in:

arial 1 0 0 1 0

Next to start mftraining:

mftraining -F font_properties -U unicharset mat.arial.exp0.tr

Shape clustering:
shapeclustering -F font_properties -U unicharset mat.arial.exp0.tr

mftraining again for shapetables:
mftraining -F font_properties -U unicharset mat.arial.exp0.tr

cntraining for clustering:

Now we have to combine all these files into one file. Now rename all the following files:

inttemp -> mat.inttemp
shapetable -> mat.shapetable
normproto -> mat.normproto
pffmtable -> mat.pffmtable
unicharset -> mat.unicharset

To generate your new tess data file:
combine_tessdata mat.

The final generated file is mat.traineddata. Move this file into the tessdata folder in the Tesseract installation folder so that the Tesseract library can access it -> C:Program FilesTesseract-OCRtessdata

To test go into one of your test data folders like “1” and run tesseract with your language file:

tesseract 1.jpg output -l mat -psm 10

In the output file you should see the character “1”. Congratulations, you have just trained your first OCR language!

Source codes:

Classifying characters:

http://pastebin.com/iJQsPh9L

Generating Tesseract training data:

http://pastebin.com/9NNk0uMB

Equation OCR Tutorial Part 1: Using contours to extract characters in OpenCV

Computer Vision, UncategorizedJanuary 10, 2013

I’ll be doing a series on using OpenCV and Tesseract to take a scanned image of an equation and be able to read it in and graph it and give related data. I was surprised at how well the results turned out =)

I will be using versions OpenCV 2.4.2 and Tesseract OCR 3.02.02.

I have also made two tutorials on installing Teseract and OpenCV for Vista x86 on Microsoft Visual Studio 2008 Express. However, you can go on the official sites for official documentation on installing the libraries on your system.

Parts

Equation OCR Part 1: Using contours to extract characters in OpenCV
Equation OCR Part 2: Training characters with Tesseract OCR
Equation OCR Part 3: Equation OCR

Tutorials

Installing OpenCV: http://blog.ayoungprogrammer.com/2012/10/tutorial-install-opencv-242-for-windows.html/

Installing Tesseract: http://blog.ayoungprogrammer.com/2012/11/tutorial-installing-tesseract-ocr-30202.html/

Official Links:

OpenCV : http://opencv.org/
Tesseract OCR: http://code.google.com/p/tesseract-ocr/

Overview:

The overall goal of the final program is to be able to convert the scanned text into a recognizable format that will be able to be processed later. We can break down this project into three parts, extracting characters from text, training for the OCR and recognition that will be able to convert images of equations into graphs of the equations.

Extraction:

Now we can break down extraction to even more steps: preprocessing and contour analysis.

Preprocessing

The first step of preprocessing is to smooth out the image and make it a binary image (black or white) for contour analysis. This is our original image:

cv::Mat img = cv::imread(“equation1.jpg”, 0);

We first apply a Gaussian blur to smooth out the image. We then use adaptive thresholding to binarize the image (make it black or white) and we then invert the colours since OpenCV uses black as the background and white as the objects.

cv::Size size(3,3);

cv::GaussianBlur(img,img,size,0);

adaptiveThreshold(img, img,255,CV_ADAPTIVE_THRESH_MEAN_C, CV_THRESH_BINARY,75,10);

cv::bitwise_not(img, img);

Next we have the fix the angle of the text. In this case the offset angle isn’t bad maybe plusminus 1 or 2 degrees, but in other cases where the angle is greater we will need to fix the angle. We can do this by finding the minimum bounding box around the line of text. This method is better for straight linear text, however I later discovered that when you have the variable y or have large brackets or when the expression is very short this method fails and the smallest area rectangle will have a large rotation. It suits our purpose thought for long equations. There is probably a better way to do this with Hu moments but this will suffice. I took this method from another blog:
http://felix.abecassis.me/2011/10/opencv-rotation-deskewing/

Now we have this rotated box around our aligned text we can just make that box our new bounding box.

Contour Extraction

Next we can use OpenCV’s contour function to detect and find all the “blobs” or shapes. I also check if any of the shapes are greater than a certain area because if the shape is very small then it is probably junk. Another issue is that some characters like the equal sign “=” contain two shapes but this can easily be fixed. We can check if two shapes are on top of another if the x coordinates of their centres are within a certain threshold. Then, we can combine the shapes and make a new contour out of it.

cv:: findContours( cropped, contours, hierarchy, CV_RETR_EXTERNAL, CV_CHAIN_APPROX_TC89_KCOS, Point(0, 0) );

Now that we have found all our contours all we need to do is extract each contour and save them. We can take the bounding rectangle of each contour and cut that part out of the original image. However, there are some cases where the bounding rectangle will take part of another shape. To prevent this, we can use a “mask” or basically a filter to copy from the bounding rectangle only pixels within the contour.

Mat mask = Mat::zeros(image.size(), CV_8UC1);
drawContours(mask, contours_poly, i, Scalar(255), CV_FILLED);

Mat extractPic;
image.copyTo(extractPic,mask);
Mat resizedPic = extractPic(r);

Here are some sample equations you can use:

Source Code

Source code is available here: http://pastebin.com/Q2x8kHmG

Tutorial: How to Install Tesseract OCR 3.02.02 for Visual Studios 2008 on Windows Vista

Computer Vision, UncategorizedNovember 4, 2012

I could not find a single good tutorial for setting up Tesseract on VS2008 other than the docs that come with Tesseract so I decided to make my own tutorial for those interested.

More updated tutorial: https://github.com/gulakov/tesseract-ocr-sample

1. Download and install the full windows version of Tesseract. This way you won’t have to extract all the different separate files.

http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-setup-3.02.02.exe
Leave the destination folder as the default (C:Program FilesTesseract-OCR)
Remember to check Tesseract Development files!

2. Open up Microsoft Visual Studio 2008 and go to Tools -> Options
Project solutions -> VC++ Directories -> Show directories for include files

Add:
C:Program FilesTesseract-OCRinclude
C:Program FilesTesseract-OCRincludetesseract
C:Program FilesTesseract-OCRincludeleptonica

3. Next click show directories for -> Library Files

Add:
C:Program FilesTesseract-OCRlib

4. Configure linker options for Tesseract

Right click your project in solution explorer and click properties

Configuration Properties -> Linker->Input ->Additional Dependencies

Add this in there:

libtesseract302.lib
libtesseract302d.lib
liblept168.lib
liblept168d.lib

**You will have to do this for every project
***I think you can do this with the property sheets but I don’t know how to set it up. Message me if you do!

5. Copy liblept168.dll, liblept168d.dll, libtesseract302.dll and libtesseract302.dll from C:Program FilesTesseract-OCR into your project folder (Optional)

If for some reason when you run your program and you get .dll missing add these files into your project folder.

6. Hello World!

To check if your project works create your main cpp file with this code:

#include <baseapi.h>
#include <allheaders.h>
#include <iostream>

using namespace std;

int main(void){

tesseract::TessBaseAPI api;
api.Init(“”, “eng”, tesseract::OEM_DEFAULT);
api.SetPageSegMode(static_cast<tesseract::PageSegMode>(7));
api.SetOutputName(“out”);

cout<<“File name:”;
char image[256];
cin>>image;
PIX *pixs = pixRead(image);

STRING text_out;
api.ProcessPages(image, NULL, 0, &text_out);

cout<<text_out.string();

}

Copy this image into your project folder: (Right click save file as)

Copy eng.traineddata from C:Program FilesTesseract-OCRtessdata into your project folder and it should output Hello World! The traineddata file will be used as the data file for reading the text.

More to come! I will be making a tutorial maybe next week on linking OpenCV with Tesseract and maybe also on how to train Tesseract.