Equation OCR Tutorial Part 1: Using contours to extract characters in OpenCV

Computer Vision, UncategorizedJanuary 10, 2013

I’ll be doing a series on using OpenCV and Tesseract to take a scanned image of an equation and be able to read it in and graph it and give related data. I was surprised at how well the results turned out =)

I will be using versions OpenCV 2.4.2 and Tesseract OCR 3.02.02.

I have also made two tutorials on installing Teseract and OpenCV for Vista x86 on Microsoft Visual Studio 2008 Express. However, you can go on the official sites for official documentation on installing the libraries on your system.

Parts

Equation OCR Part 1: Using contours to extract characters in OpenCV
Equation OCR Part 2: Training characters with Tesseract OCR
Equation OCR Part 3: Equation OCR

Tutorials

Installing OpenCV: http://blog.ayoungprogrammer.com/2012/10/tutorial-install-opencv-242-for-windows.html/

Installing Tesseract: http://blog.ayoungprogrammer.com/2012/11/tutorial-installing-tesseract-ocr-30202.html/

Official Links:

OpenCV : http://opencv.org/
Tesseract OCR: http://code.google.com/p/tesseract-ocr/

Overview:

The overall goal of the final program is to be able to convert the scanned text into a recognizable format that will be able to be processed later. We can break down this project into three parts, extracting characters from text, training for the OCR and recognition that will be able to convert images of equations into graphs of the equations.

Extraction:

Now we can break down extraction to even more steps: preprocessing and contour analysis.

Preprocessing

The first step of preprocessing is to smooth out the image and make it a binary image (black or white) for contour analysis. This is our original image:

cv::Mat img = cv::imread(“equation1.jpg”, 0);

We first apply a Gaussian blur to smooth out the image. We then use adaptive thresholding to binarize the image (make it black or white) and we then invert the colours since OpenCV uses black as the background and white as the objects.

cv::Size size(3,3);

cv::GaussianBlur(img,img,size,0);

adaptiveThreshold(img, img,255,CV_ADAPTIVE_THRESH_MEAN_C, CV_THRESH_BINARY,75,10);

cv::bitwise_not(img, img);

Next we have the fix the angle of the text. In this case the offset angle isn’t bad maybe plusminus 1 or 2 degrees, but in other cases where the angle is greater we will need to fix the angle. We can do this by finding the minimum bounding box around the line of text. This method is better for straight linear text, however I later discovered that when you have the variable y or have large brackets or when the expression is very short this method fails and the smallest area rectangle will have a large rotation. It suits our purpose thought for long equations. There is probably a better way to do this with Hu moments but this will suffice. I took this method from another blog:
http://felix.abecassis.me/2011/10/opencv-rotation-deskewing/

Now we have this rotated box around our aligned text we can just make that box our new bounding box.

Contour Extraction

Next we can use OpenCV’s contour function to detect and find all the “blobs” or shapes. I also check if any of the shapes are greater than a certain area because if the shape is very small then it is probably junk. Another issue is that some characters like the equal sign “=” contain two shapes but this can easily be fixed. We can check if two shapes are on top of another if the x coordinates of their centres are within a certain threshold. Then, we can combine the shapes and make a new contour out of it.

cv:: findContours( cropped, contours, hierarchy, CV_RETR_EXTERNAL, CV_CHAIN_APPROX_TC89_KCOS, Point(0, 0) );

Now that we have found all our contours all we need to do is extract each contour and save them. We can take the bounding rectangle of each contour and cut that part out of the original image. However, there are some cases where the bounding rectangle will take part of another shape. To prevent this, we can use a “mask” or basically a filter to copy from the bounding rectangle only pixels within the contour.

Mat mask = Mat::zeros(image.size(), CV_8UC1);
drawContours(mask, contours_poly, i, Scalar(255), CV_FILLED);

Mat extractPic;
image.copyTo(extractPic,mask);
Mat resizedPic = extractPic(r);

Here are some sample equations you can use:

Source Code

Source code is available here: http://pastebin.com/Q2x8kHmG

beej

February 1, 2013

Reply

why not use connected components instead of contours? maybe this is more computationally expensive?

ayoungprogrammer

February 3, 2013

Reply

I could, but OpenCV already has a built in contour so I just used that
son cho

November 20, 2015

Reply

địa dị tượng ngay tại ngàn dậm ở ngoài, này phạm vi trong vòng nhân tự nhiên là sẽ không bỏ qua, đều hướng vừa mới kia hồng mang nơi ở mà đi.

"Ba vị đạo hữu, xin dừng bước." Ngay đúng lúc ba người bước đi, sáu đạo thân ảnh xuất hiện trước mặt ba ngdongtam
game mu
http://nhatroso.net/
http://nhatroso.com/
nhac san cuc manh
tư vấn luật
dịch vụ thành lập công ty trọn gói
văn phòng luật
tổng đài tư vấn pháp luật
thành lập công ty
http://we-cooking.com/
chém gió
trung tâm ngoại ngữười Nhạc Thành, cầm đầu chính là một thanh niên áo trắng , trong tay cầm một cây quạt.

Mặt khác không người mặc các loại đạo bào, có lão giả cũng có trung niên hán tử bộ dáng người, áo bào trắng thanh niên Đại La Kim Tiên tiền kỳ, mà còn lại năm người đều là La Thiên thượng tiên trung kỳ, một cái hắc y lão giả La Thiên thượng tiên hậu kỳ tu vi, cũng đều là cường giả. Sáu người tựa hồ đều là Tiên đạo tán tu bộ dáng.

"Đạo hữu có việc?" Nhạc Thành đối vừa mới nói chuyện áo bào trắng thanh niên hỏi, trong lòng lộ ra một tia chán ghét cảm, này áo bào trắng thanh niên ánh mắt vẫn liền thẳng ngoắc ngoắc nhìn ở tại Diễm Ma trên người, mặc dù là Diễm Ma hiện tại là Yến Hiểu Kỳ bộ dáng, đan cũng là sắc nước hương trời khuynh quốc khuynh thành chi mạo, càng thêm là có thêm một cỗ cao quý chính là khí chất ở bên trong, kia áo bào trắng thanh niên liền xem ánh mắt khoái điều Đã ra rồi.

Henry Jordan

February 9, 2013

Reply

Very good brief and this post helped me a lot. Say thank you for this knowledgeable post.
The Equation
Yonas Teodros

February 18, 2013

Reply

tnx
Mzk

April 30, 2013

Reply

Nice one 😉
Unknown

October 31, 2013

Reply

Hello Michael, Do we need to do contour analysis? I thought we could directly train tesseract without step 1 in your tutorial. I'm a newbie to this field so would love to know more.

ayoungprogrammer

October 31, 2013

Reply

I used contour analysis because the scope of the project was just for scanning math equations. By using contours, I was able to extract every single character and OCR them individually with more precision. However, Tesseract is also equipped for whole paragraphs and uses nearby characters to improve OCR results as well.

Step 1 was used mostly to remove background noise and setup an environment for recognition.

Depending on what requirements of your program, you may or may not need step 1 in the tutorial. Good luck!

JAVA – core java

July 10, 2014

Reply

I am getting opencv_highgui230.dll..did i missed something here?

ayoungprogrammer

July 10, 2014

Reply

Find the opencv_highgui230.DLL in the install folder and move it to your project folder. Do this for all missings DLLs.

buyi wen

November 9, 2015

Reply

if you like tesseract ocr, you may like this free online ocr tool using tesseract ocr 3.02
Equation OCR Tutorial Part 3: Making an OCR for Equations using OpenCV and Tesseract – ayoungprogrammer's blog

July 3, 2016

Reply

[…] OCR Part 1: Using contours to extract characters in OpenCV Equation OCR Part 2: Training characters with Tesseract OCR Equation OCR Part 3: […]
khan

November 10, 2016

Reply

hello everyone ,
i have a question.is it possible that we can have extracted contours in the sequence as they are in equation like first x then 2 then brackets then 3 etc..
thankyou

ayoungprogrammer

November 10, 2016

Reply

You can get the sequence by sorting the contours from left to right

Sharath

December 9, 2016

Reply

Hello admin
is it possible to extract characters out of equation using opencv in android?
i’m developing an app and i need your help asap.
Thank you n have a nice day.
ND Minh

February 21, 2017

Reply

Hello ad,
Could you please upload source code again for Part 1?
Thank you so much!

ayoungprogrammer

February 23, 2017

Reply

http://pastebin.com/Q2x8kHmG

MY

March 14, 2017

Reply

Could you please let me know how to proceed if i have some white color text with black background and some black color text with white background.