Equation OCR Tutorial Part 1: Using contours to extract characters in OpenCV

Categories Computer Vision, Uncategorized
I’ll be doing a series on using OpenCV and Tesseract to take a scanned image of an equation and be able to read it in and graph it and give related data. I was surprised at how well the results turned out =)

I will be using versions OpenCV 2.4.2 and Tesseract OCR 3.02.02.

I have also made two tutorials on installing Teseract and OpenCV for Vista x86 on Microsoft Visual Studio 2008 Express. However, you can go on the official sites for official documentation on installing the libraries on your system.

Parts

Equation OCR Part 1: Using contours to extract characters in OpenCV
Equation OCR Part 2: Training characters with Tesseract OCR
Equation OCR Part 3: Equation OCR

Tutorials

Installing OpenCV: http://blog.ayoungprogrammer.com/2012/10/tutorial-install-opencv-242-for-windows.html/

OpenCV : http://opencv.org/

Overview:

The overall goal of the final program is to be able to convert the scanned text into a recognizable format that will be able to be processed later. We can break down this project into three parts, extracting characters from text, training for the OCR and recognition that will be able to convert images of equations into graphs of the equations.

Extraction:

Now we can break down extraction to even more steps: preprocessing and contour analysis.

Preprocessing

The first step of preprocessing is to smooth out the image and make it a binary image (black or white) for contour analysis. This is our original image:
We first apply a Gaussian blur to smooth out the image. We then use adaptive thresholding to binarize the image (make it black or white) and we then invert the colours since OpenCV uses black as the background and white as the objects.
cv::Size size(3,3);
cv::GaussianBlur(img,img,size,0);
cv::bitwise_not(img, img);

Next we have the fix the angle of the text. In this case the offset angle isn’t bad maybe plusminus 1 or 2 degrees, but in other cases where the angle is greater we will need to fix the angle. We can do this by finding the minimum bounding box around the line of text. This method is better for straight linear text, however I later discovered that when you have the variable y or have large brackets or when the expression is very short this method fails and the smallest area rectangle will have a large rotation. It suits our purpose thought for long equations. There is probably a better way to do this with Hu moments but this will suffice. I took this method from another blog:
http://felix.abecassis.me/2011/10/opencv-rotation-deskewing/
Now we have this rotated box around our aligned text we can just make that box our new bounding box.

Contour Extraction

Next we can use OpenCV’s contour function to detect and find all the “blobs” or shapes. I also check if any of the shapes are greater than a certain area because if the shape is very small then it is probably junk. Another issue is that some characters like the equal sign “=” contain two shapes but this can easily be fixed. We can check if two shapes are on top of another if the x coordinates of their centres are within a certain threshold. Then, we can combine the shapes and make a new contour out of it.
cv:: findContours( cropped, contours, hierarchy, CV_RETR_EXTERNAL, CV_CHAIN_APPROX_TC89_KCOS, Point(0, 0) );

Now that we have found all our contours all we need to do is extract each contour and save them. We can take the bounding rectangle of each contour and cut that part out of the original image. However, there are some cases where the bounding rectangle will take part of another shape. To prevent this, we can use a “mask” or basically a filter to copy from the bounding rectangle only pixels within the contour.

Mat extractPic;
Mat resizedPic = extractPic(r);

Here are some sample equations you can use:

Source Code

Source code is available here: http://pastebin.com/Q2x8kHmG

• beej
February 1, 2013

why not use connected components instead of contours? maybe this is more computationally expensive?

• ayoungprogrammer
February 3, 2013

I could, but OpenCV already has a built in contour so I just used that

• son cho
November 20, 2015

địa dị tượng ngay tại ngàn dậm ở ngoài, này phạm vi trong vòng nhân tự nhiên là sẽ không bỏ qua, đều hướng vừa mới kia hồng mang nơi ở mà đi.

"Ba vị đạo hữu, xin dừng bước." Ngay đúng lúc ba người bước đi, sáu đạo thân ảnh xuất hiện trước mặt ba ngdongtam
game mu
http://nhatroso.net/
http://nhatroso.com/
nhac san cuc manh
tư vấn luật
dịch vụ thành lập công ty trọn gói
văn phòng luật
tổng đài tư vấn pháp luật
thành lập công ty
http://we-cooking.com/
chém gió
trung tâm ngoại ngữười Nhạc Thành, cầm đầu chính là một thanh niên áo trắng , trong tay cầm một cây quạt.

Mặt khác không người mặc các loại đạo bào, có lão giả cũng có trung niên hán tử bộ dáng người, áo bào trắng thanh niên Đại La Kim Tiên tiền kỳ, mà còn lại năm người đều là La Thiên thượng tiên trung kỳ, một cái hắc y lão giả La Thiên thượng tiên hậu kỳ tu vi, cũng đều là cường giả. Sáu người tựa hồ đều là Tiên đạo tán tu bộ dáng.

"Đạo hữu có việc?" Nhạc Thành đối vừa mới nói chuyện áo bào trắng thanh niên hỏi, trong lòng lộ ra một tia chán ghét cảm, này áo bào trắng thanh niên ánh mắt vẫn liền thẳng ngoắc ngoắc nhìn ở tại Diễm Ma trên người, mặc dù là Diễm Ma hiện tại là Yến Hiểu Kỳ bộ dáng, đan cũng là sắc nước hương trời khuynh quốc khuynh thành chi mạo, càng thêm là có thêm một cỗ cao quý chính là khí chất ở bên trong, kia áo bào trắng thanh niên liền xem ánh mắt khoái điều Đã ra rồi.

• Henry Jordan
February 9, 2013

Very good brief and this post helped me a lot. Say thank you for this knowledgeable post.
The Equation

• Yonas Teodros
February 18, 2013

tnx

• Mzk
April 30, 2013

Nice one 😉

• Unknown
October 31, 2013

Hello Michael, Do we need to do contour analysis? I thought we could directly train tesseract without step 1 in your tutorial. I'm a newbie to this field so would love to know more.

• ayoungprogrammer
October 31, 2013

I used contour analysis because the scope of the project was just for scanning math equations. By using contours, I was able to extract every single character and OCR them individually with more precision. However, Tesseract is also equipped for whole paragraphs and uses nearby characters to improve OCR results as well.

Step 1 was used mostly to remove background noise and setup an environment for recognition.

Depending on what requirements of your program, you may or may not need step 1 in the tutorial. Good luck!

• JAVA – core java
July 10, 2014

I am getting opencv_highgui230.dll..did i missed something here?

• ayoungprogrammer
July 10, 2014

Find the opencv_highgui230.DLL in the install folder and move it to your project folder. Do this for all missings DLLs.

November 9, 2015

if you like tesseract ocr, you may like this free online ocr tool using tesseract ocr 3.02

• Equation OCR Tutorial Part 3: Making an OCR for Equations using OpenCV and Tesseract – ayoungprogrammer's blog
July 3, 2016

[…] OCR Part 1: Using contours to extract characters in OpenCV Equation OCR Part 2: Training characters with Tesseract OCR Equation OCR Part 3: […]

• khan
November 10, 2016

hello everyone ,
i have a question.is it possible that we can have extracted contours in the sequence as they are in equation like first x then 2 then brackets then 3 etc..
thankyou

• ayoungprogrammer
November 10, 2016

You can get the sequence by sorting the contours from left to right

• Sharath
December 9, 2016

is it possible to extract characters out of equation using opencv in android?
i’m developing an app and i need your help asap.
Thank you n have a nice day.

• ND Minh
February 21, 2017