Optical Character Recognition (OCR) in Python with Tesseract 4: A tutorial

A tutorial based on hands-on experience with Tesseract 4 in Python for OCR
Author

Sid Metcalfe

Published

November 18, 2023

Introduction

I first dabbled in OCR when I needed to convert a pile of old letters into text. The thought of retyping each one was daunting, until I discovered the power of Tesseract 4. It’s truly amazing how this technology can read text from images, transforming the way we deal with printed materials. In the Python ecosystem, Tesseract can be surprisingly friendly, even for beginners. Today, I want to share insights on using Tesseract for OCR, from quick setups to advanced tweaks, to make your digital life a bit easier.

Introduction to OCR and Tesseract 4

A diagram illustrating the process of ocr converting images of text into machine-encoded text

Optical Character Recognition, or OCR, allows us to transform the static characters from images into modifiable and searchable text, which opens up a vast array of possibilities for data processing and automation. It’s like taking the silent frames of a picture and giving them a voice. One of the most potent tools in the realm of OCR is Tesseract. Tesseract 4, with its neural network-based engine, is an industry standard—it’s open-source, versatile, and pretty efficient.

For those exploring OCR, especially in the Python ecosystem, Tesseract 4 can be intimidating. But once you dive into it, you’ll find that it can be quite friendly. Tesseract’s power, combined with Python’s ease of use, offers a compelling solution for OCR tasks.

Imagine you have a scanned document—it’s a jpeg image, and you’re tasked with extracting all the text from it without typing a single word. That’s where Tesseract kicks in. The first step is to ensure you have image files that Tesseract can analyze. Most commonly, these are formats like PNG or JPEG.

So, let’s start with a simple example written in Python to see Tesseract in action. First, you’ll need an image with some text on it. For this example, let’s say it’s named ‘sample.jpg’. You want to extract the text from that image. Here’s a basic snippet that uses Tesseract to do just that:

from PIL import Image
import pytesseract

# Let's start by opening an image file
image = Image.open('sample.jpg')

# Now we'll pass this image to Tesseract to do OCR
text = pytesseract.image_to_string(image)

# Finally, let's print the extracted text
print(text)

When you run this piece of code, you’ll see text from the image regurgitated onto your console—a moment of true gratification for a beginner playing with OCR. It might not be perfect, and that’s okay because Tesseract provides a multitude of options to enhance the accuracy—which, however, you’ll delve into later in the advanced section of the tutorial.

Documentation and resources are incredibly crucial when learning to implement and tweak Tesseract settings. The GitHub repository (https://github.com/tesseract-ocr/tesseract) supplements your journey with details, and you’re bound to rummage through pages of Tesseract’s documentation (https://tesseract-ocr.github.io/tessdoc/). Google Groups forums and Stack Overflow threads often contain nuggets of wisdom from community experiences, so don’t hesitate to dive into those when you hit a snag.

Now, Tesseract alone is powerful, but its true potential is unlocked when you preprocess images for better recognition. This can include noise reduction, thresholding, or scaling, and these processes are straightforward with libraries like OpenCV or PIL (Python Imaging Library). Here’s a nudge in the right direction:

import cv2

# Load the image using OpenCV
image_cv = cv2.imread('sample.jpg')

# Convert the image to gray-scale
gray = cv2.cvtColor(image_cv, cv2.COLOR_BGR2GRAY)

# Apply a threshold to get a binary image
_, thresh_img = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)

# Now, let's run Tesseract on this preprocessed image
text = pytesseract.image_to_string(Image.fromarray(thresh_img))

# Print out the text
print(text)

In this snippet, before feeding the image to Tesseract, I first convert it to grayscale and apply a binary threshold—this enhances the text’s contrast against the background.

As simple as this process is, remember that deploying OCR isn’t always a walk in the park. You will encounter images that will test your patience, and this is really where the synergy between OCR and the right preprocessing techniques proves pivotal.

Mastering Tesseract involves patience, experimentation, and persistence, and through this journey, you will amass an arsenal of tools that will handle an array of OCR challenges. This introduction is just the tip of the iceberg—so brace for the plunge into the depths of this technological marvel. If you’re interested in further expanding your data analysis toolset, consider diving into A short guide on using Dask for data analysis in Python with multiple cores (2023).

Setting up the Python Environment for Tesseract

A screenshot of a python ide with command lines showing the successful installation of tesseract and its python bindings

Setting up a Python environment for Tesseract is a straightforward process, which I’ve streamlined over several projects. Here’s my step-by-step guide to ensure you hit the ground running with Tesseract for OCR in Python.

First things first, you’ll need Python installed on your machine. I’m assuming you’ve got that covered, but if not, head to the official Python download page at https://www.python.org/downloads/ and grab the latest version for your OS. Once Python is set up, I highly recommend using a virtual environment for your Tesseract project. This keeps dependencies neatly bundled and isolated. To create one, open your terminal or command prompt and enter:

python -m venv ocr-env

After creating the virtual environment named “ocr-env,” activate it. On Windows, you can do this with:

ocr-env\Scripts\activate.bat

On macOS and Linux:

source ocr-env/bin/activate

With the environment active, it’s time to install pytesseract, the Python wrapper we will be using to interact with the Tesseract engine. Install it using pip:

pip install pytesseract

However, pytesseract is only a wrapper; you still need Tesseract itself. If you’re on a Mac, you can install Tesseract using Homebrew:

brew install tesseract

Linux users can often install Tesseract using the package manager, for example, on Ubuntu:

sudo apt update
sudo apt install tesseract-ocr

And for Windows, you’ll need to download the installer from the Tesseract at UB Mannheim repository, found here: https://github.com/UB-Mannheim/tesseract/wiki.

Once Tesseract is installed, verify the installation by running:

tesseract --version

If Tesseract is successfully installed, you should see the version number. Next, ensure pytesseract can find the Tesseract executable. If it’s in your PATH, pytesseract will find it automatically, but sometimes you need to set it manually in your code:

import pytesseract

pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'  # Update the path to the Tesseract executable if it's different on your machine.

Now, let’s install Pillow - a powerful imaging library, which makes working with images in Python a breeze:

pip install Pillow

With all dependencies installed, let’s do a quick test to see if everything is wired up correctly. Save an image with some text to your project directory as “test-image.png” and run the following:

from PIL import Image
import pytesseract

# Open an image file
with Image.open("test-image.png") as img:
    text = pytesseract.image_to_string(img)

# Output the text found in the image
print(text)

If you see the text from the image printed out, congratulations – your environment is set up!

This basic setup paves the way for you to dive into OCR with Tesseract in Python. Remember to refer to the Tesseract documentation for additional customization and optimization options to maximize the efficiency and accuracy of your OCR tasks. Happy coding!

Using Tesseract to Recognize Text from Images

A before and after comparison image showing the original picture and the corresponding extracted text

Optical Character Recognition has always struck me as something close to magic. Transforming images into editable text using Tesseract isn’t just a neat trick, it’s a powerful tool for automating data entry, assisting visually impaired users, and digitizing printed documents. Let’s dive into how we can harness this technology using Python.

To extract text from an image, we first need an image! For this walk-through, I’ve chosen a straightforward PNG image with some clear text. The aim is to have Tesseract interpret the text and print it out for us.

Here’s the code snippet to load the image and use Tesseract to do its work:

from PIL import Image
import pytesseract

# Assuming Tesseract is correctly installed and pytesseract python module is installed
# Path to the image we want to extract text from
image_path = 'sample_image.png'

# Open the image with PIL (Python Imaging Library)
image = Image.open(image_path)

# Use pytesseract to do OCR on the image
text = pytesseract.image_to_string(image)

print(text)

When you run this, you should see the text from the image printed out in the console. It’s thrilling to see this happen for the first time!

But sometimes, the image isn’t perfect (welcome to the real world!). Maybe it’s skewed or has a wacky layout. To improve Tesseract’s accuracy, we can do some image preprocessing. Let’s see how this would work:

from PIL import Image, ImageEnhance, ImageFilter

# Enhance the image to make it more suitable for OCR
enhancer = ImageEnhance.Contrast(image)
image_enhanced = enhancer.enhance(2)  # Increase contrast

# Convert image to black and white
image_enhanced = image_enhanced.convert('L') 
image_enhanced = image_enhanced.point(lambda x: 0 if x < 128 else 255, '1')

# Apply some filters
image_filtered = image_enhanced.filter(ImageFilter.SHARPEN)

# Now let's try Tesseract again with the pre-processed image
text_enhanced = pytesseract.image_to_string(image_filtered)

print(text_enhanced)

This pre-processing often yields better results, especially if the original image was less than ideal.

Now, let’s say we are working with a multi-lingual document, or you’re trying to OCR a language other than English. Tesseract has got you covered; it comes with support for multiple languages. This is how you tell it to look for text in, say, both English and German:

# Let's OCR a bilingual English and German document
text = pytesseract.image_to_string(image, lang='eng+deu')
print(text)

Remember to have the appropriate language data installed for Tesseract.

Sometimes you’ll find yourself dealing with documents that are just pages and pages of images. Automating the processing of multiple images is our final stop. Here’s a Python script to loop through a directory of images and apply our OCR script:

import os

# Directory containing images to be OCR'd
image_dir = 'path/to/image_directory'

# Loop through all the image files
for image_filename in os.listdir(image_dir):
    image_path = os.path.join(image_dir, image_filename)
    image = Image.open(image_path)

    # I'll just use the basic OCR here for brevity
    text = pytesseract.image_to_string(image)
    print(f"Text from {image_filename}:")
    print(text)
    print("-" * 40)

With these examples, I hope you’re feeling empowered to go out there and start experimenting with Tesseract and OCR in your Python projects. It’s a potent combination that handles a surprising variety of real-world tasks. The rewards for diving in and getting your hands digitally dirty are immense—this is just the beginning of what’s possible.

Advanced Techniques and Tips for Tesseract OCR

A flowchart showcasing various preprocessing techniques applied to an image before running it through tesseract for improved accuracy

After delving into the basics of OCR with Tesseract 4 in Python, it’s time to roll up our sleeves and dive into some advanced techniques. This might seem a bit daunting at first, but with a few tips and a bit of practice, you’ll witness a significant boost in the accuracy and efficiency of your OCR tasks. Let’s get to it.

One of the first things I do when I want to extract text from complex images is preprocessing. Image preprocessing can drastically improve Tesseract’s accuracy. For that, I often employ the PIL (Pillow) or cv2 (OpenCV) libraries. Here’s a quick snippet to convert an image to grayscale and then to binary (black and white) using OpenCV:

import cv2

# Read the image file
image = cv2.imread('example.jpg')

# Convert it to grayscale
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Convert to binary image
thresh, bw_image = cv2.threshold(gray_image, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)

# Save the preprocessed image if needed, for comparison purposes
cv2.imwrite('bw_image.jpg', bw_image)

Noise reduction is another preprocessing step that often helps. Too much noise can make OCR less accurate, so smoothing out an image is important. You can apply a simple Gaussian blur:

# Apply Gaussian blur
blurred_image = cv2.GaussianBlur(bw_image, (5, 5), 0)

These preprocessing steps can enhance image quality, thus improving OCR results. Tesseract offers a range of configuration options that can tune its behavior. For instance, setting the Page Segmentation Mode (PSM) can have a big impact. Let’s try different PSM values:

import pytesseract

custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(bw_image, config=custom_config)
print(text)

Here, --oem 3 sets the OCR Engine Mode to the default which combines both LSTM and legacy engine, and --psm sets the Page Segmentation Mode. I usually play around with different psm values (0-13), as it tailors Tesseract’s approach to your specific image structure. For instance, psm 6 assumes a single uniform block of text, which works well for clean images with simple layouts.

Another tip is to provide Tesseract with a whitelist of characters using the -c tessedit_char_whitelist parameter to restrict the recognition to numbers and uppercase letters, like so:

custom_config = r'-c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ --psm 6'
text = pytesseract.image_to_string(bw_image, config=custom_config)
print(text)

Sometimes, I train Tesseract with custom fonts or languages. This can be a game-changer for niche applications with specialized documents. Training Tesseract is beyond beginner territory, but it’s worth exploring if you require it. You can find the training tools and documentation here. However, be prepared for some trial and error; it can be quite a process.

Lastly, let’s talk about evaluating accuracy. One way I often use is the Levenshtein distance. It measures how many single-character edits are required to change one word into another. You can use the python-Levenshtein package to do this:

import Levenshtein as lev

ground_truth = 'expected text'
ocr_result = pytesseract.image_to_string(image, config=custom_config)

distance = lev.distance(ground_truth, ocr_result)
print(f"Levenshtein distance: {distance}")

The lower the distance, the closer your OCR result is to the expected text. It’s a handy tool for quick accuracy checks.

Remember, OCR with Tesseract is part art, part science. Don’t hesitate to try different preprocessing techniques, Tesseract configurations, and to even dive into training if needed. With patience and practice, you’ll start seeing the text emerge more accurately from the images. Keep iterating and learning; there’s always a new trick to uncover in the ever-evolving field of OCR. Happy coding!