Optical character recognition (OCR), sometimes called optical character reading, is the process of reading and converting text from images into a machine readable format like a string. Images of old documents, receipts, license plates, and house numbers can all contain useful text. Reading this text manually from images can be time-consuming and labor intensive. This is where OCR comes into play. OCR is an important task in computer vision as it allows automatic digitization of text from various sources.

In this tutorial, we’ll show you how to convert text from images into machine readable format with the help of the Python Pytesseract module. Pytesseract is a Python wrapper for Google’s Tesseract library for OCR. With the help of Pytesseract, we’ll be able to use Python to convert the words in an image to a string.

Installing the Google Tessearact OCR Engine

Before you can perform OCR in Python using the Pytesseract module, you need to first install the Tesseract OCR engine by Google. You can download the executable file for the Tesseract engine from GitHub.

Follow these instructions to install the OCR engine:

Once you open the exeutable file, you’ll have to first select a language.

Tesseract language option

Click the “Next” button on the following dialog box.

Tesseract installation step 1

You’ll be presented with a license agreement, as shown below. Click the “I Agree” button if you agree to the terms.

Tesseract installation step 2

You can install the Tesseract library for all the users using your system or only for you. Choose the option you want from the following dialog box and click “Next” button.

Tesseract installation step 3

Next, select the components that you want to include in your installation package. I suggest keeping the default components and clicking the “Next” button.

Tesseract installation step 4

The next dialog box will ask you to specify the installation location. Set the installation location and click the “Next” button.

Tesseract installation step 5

Select a Start Menu folder if you want from the following dialog box and click the “Install” button.

Tesseract installation step 6

Installation will begin and you should see the following screen once the installation completes. Click the “Next” button.

Tesseract installation step 7

Finally, to close the installation setup, click the “Finish” button on the last dialog box.

Tesseract installation step 8

Installing Required Python Libraries

After Google’s Tesseract engine is installed, you need to install the Pytesseract and Pillow modules for Python. The Pytesseract module is a Python wrapper for the Tesseract engine you just installed. We’ll use the [Pillow library]/python/python-image-manipulation-with-pillow-library/) to import images in this tutorial. Execute the following pip commands on your command terminal to install the two required libraries:


$ pip install pytesseract
$ pip install Pillow

Reading Text From Images

We’re almost ready to read text from images. Before that, though, you need to import the Pytesseract and Pillow libraries, and you also have to specify the Path for your Tesseract engine.

The following script imports the required libraries:

import pytesseract as pt
from PIL import Image

The next script specifies the path for the Tesseract engine executable file we installed earlier. You need to update your path accordingly if you installed Tesseract OCR in a different location.

pt.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit

Reading a Car License Plate

For our first example, we’ll read the text from the following foreign car license plate image.

car registration

To read the text from an image the first step is to open the image. You can do so via the open() method from the Image object of the Pillow library. Next, to actually read the text from an image, you need to pass the image object you just opened to the image_to_string() method of the Pytesseract module. The image_to_string() method converts the image text into a Python string which you can then use however you want. We’re simply going to print the string to our screen using the print() method. Execute the following script to read the text from the car number plate image.

img_object = Image.open(r"C:/Datasets/car_reg.jpg")
img_text = pt.image_to_string(img_object)
print(img_text)

You’ll need to update the path of the image to match the location of the image you want to convert to string.

From the output below you can see that the number and characters from the number plate have been correctly read. However, in addition to numbers and characters, a leading digit 7 is also printed in the output. This is because the “E” at the left of the number plate is wrongly recognized as 7. The rest of the numbers and characters are correctly recognized.

Output:

71111 AAA

The output shows that though Tesseract OCR is capable of reading text from an image, it is not 100% correct. An accuracy rate of less than 100% is typical with all OCR engines, so don’t let this discourage you.

Reading Text from a Simple Image

For our second test, let’s try to read text from the following image.

Text Image

The process is same as we showed in our last example. You have to first open the image and then pass the image object to the image_to_string() method of the Pytesseract module as shown below:

img_object = Image.open(r"C:/Datasets/name.jpg")
img_text = pt.image_to_string(img_object)
print(img_text)

The output below shows that the text “What is your name” has been successfully recognized. However, the pen at the top right of the image is recognized as the character A which is false. The output again shows that Pytesseract is capable of reading text but it is not 100% accurate.

Output:

What is A

your name ?

Reading a Simple Text Receipt

The Pytesseract module returns best results when reading a black and white image where text is in black font in front of a white background, like a picture or a scan of a normal piece of printed paper. Let’s try to read text from the image of a receipt where text is in black font in front of a white background. Here’s the receipt we’re going to use in this example.

Image of a receipt

Execute the following script to read the text:

img_object = Image.open(r"C:/Datasets/receipt.png")
img_text = pt.image_to_string(img_object)
print(img_text)

As expected, from the output below, you can see that the text has been recognized with 100% accuracy.

Output:

Test Receipt for USB Printer 1

Mar 17, 2018
10:12 PM



Ticket: 01



Item $0,00

Total $0.00

Get Our Python Developer Kit for Free

I put together a Python Developer Kit with over 100 pre-built Python scripts covering data structures, Pandas, NumPy, Seaborn, machine learning, file processing, web scraping and a whole lot more - and I want you to have it for free. Enter your email address below and I'll send a copy your way.

Yes, I'll take a free Python Developer Kit