Training Tesseract

Lest I forget.

April 23, 2014

Introduction

Tesseract is an open source OCR tool originally developed by HP and now used by Google and others. The source repository is here:

https://github.com/tesseract-ocr

It is primarily a command-line tool. However, there is a library that is available for programmatic access, which has then been ported to Windows. In addition, there is a .NET wrapper API for it available on GitHub and also in NuGet, so installing it in a Visual Studio project is easy:

https://github.com/charlesw/tesseract

Before Tesseract can be used, it must be “trained.” This involves a series of steps that teach it both the font(s) that will be used as well as the language. There are defaults for many languages and standard fonts (Arial, etc.) on the project site itself, including the default one for English:

https://github.com/tesseract-ocr/tesseract/releases

However, if you want to use a different, “non-standard” font, such as OCR-A, you must train Tesseract for that. This document explains how I trained Tesseract for OCR-A for a work project.

Resources

You will want to become familiar with all of the following. Really. Even though in the end the steps I give should “just work,” understanding what is going on is helpful.

Training Steps

The following describes what to do to create a new traineddata file, which then goes in the tessdata folder that will be under your project’s executable location. In other words, if your project executes out of C:\Program Files\Foo, then the tessdata folder should be at C:\Program Files\Foo\tessdata.

You will have to pick a font (in our example OCR-A) and a three-character language code. The language doesn’t have to be real. In our example we use “zzz” as the language code. When you use Tesseract, you tell it which language to load.

All of the following assumes you have access to a Linux box1 with the following installed:

Step 1 – Convert “truth file” to image

This uses the text2img.py script in tess_school. Per the readme:

-text2img.py: Takes a ground-truth text file and automatically generates image files from the text, for use in training tesseract. Everything is hard coded at the moment, no command-line options yet. Eventually I’d like to have this generate the boxfile too.

First I changed the hardcoded language in the script from ka/kat to en/eng, i.e.:

LANG = "en"
TESS_LANG = "eng"

Note: Even though I ended up using “zzz” as the language, it was easier in the interim to work with “eng” because a lot of samples on the Internet assume it.

I then ran it in the shell as follows:

python text2img.py -f text.txt OCRA

Let’s pull that apart a bit. Besides the script itself, there are two things of interest:

  • text.txt is the input file. It contains the characters we are going to use for training. In this case I made the file a simple one that had all upper and lowercase letters, numbers, and all special characters accessible on a keyboard. Here are the contents:

    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890`~!@#$%^&*()-_=+[{]}\|;:'",<.>/?
  • OCRA is the name of the font to output the text as in the resulting image. It has to be the official name of a font installed on the system you are running the script on. On Debian/Ubuntu/Mint flavors, fonts are installed under /usr/share/fonts, and then you have to know what type of font it is, such as OpenType (.otf) or TrueType (.ttf). In our case OCR-A is a TrueType font, and the files are located at /usr/share/fonts/truetype/ocr-a. In that directory are the following files, and by using OCRA we are telling it to use the main OCR-A font:
    • OCRA.ttf
    • OCRABold.ttf
    • OCRACondensed.ttf
    • OCRAItalic.ttf

This brings up another point. When training for fonts, you have to train for normal vs. bold vs. italic vs. bold and italic together all separately. In our case we just want the normal font, so that keeps things simple.

The output of the text2img.py execution will be a file in the same directory named eng.OCRA.exp0.png. That is the language (“eng”), the font we chose (“OCRA”), the “exp0” is used to allow combining multiple training files together (for example, you could have “exp0” be for the normal font face, “exp1” be for bold, etc.).

Step 2 – Convert the PNG to a TIFF file

While Tesseract can use PNG files for training, apparently it works better with TIFFs, per the tess_school readme:

-png2tif.sh: Uses ImageMagick to convert the PNG output from text2img.py to TIFF files. Tesseract can read PNG files, but sometimes seems to prefer TIFF.

No changes were required. Just run the script:

./png2tif.sh eng.OCRA.exp0.png

You will then have a file in the same directory called eng.OCRA.exp0.tif.

Step 3 – Make the “box” file(s)

This step is where the rubber meets the road. It involves letting Tesseract loose on the image we’ve produced and see if it can figure out where the characters are and what they are. A “box” file is simply a text file with a simple format:

  • The character Tesseract guessed in a specific location.
  • Four fields that are the coordinates of the character.
  • A page number (zero-relative), which for us is always zero (training multi-page docs is harder).

The make_boxes.sh script generates the box file(s), looking for any .tif files in the directory. The only thing that needs to be changed is the language code, which again I changed to “eng”:

LANG=eng

Then you just run the script:

./make_boxes.sh

You will then have a eng.OCRA.exp0.box file in the same directory.

Step 4 – Merge adjacent boxes

Tesseract can sometimes “see” multiple characters where there is in reality only one. This script helps fix that, per the readme:

-merge_boxes.py: Merges nearby boxes in a boxfile resulting from tesseract oversegmenting characters. This is a common error that Tesseract makes and this script will quickly fix most instances of this problem.

I had to make no changes to the script. You run it with:

python merge_boxes.py -d eng.OCRA.exp0.box

The -d parameter indicates it’s a “dry run” and will just indicate if there were any boxes that needed merging. In my case there weren’t. See the script for other parameters in case the dry run indicates there might be boxes needing merging.

Step 5 – Align the box file to the truth file

In a typical Tesseract training, you would then go through the box file with a box file editor such as Moshpytt, checking and correcting each and every character. On large box files that is a complete PITA, and from my testing large box files (lots of input characters) didn’t seem to significantly increase the accuracy. YMMV. Instead, tess_school has a script that automatically takes in the truth file used to generate the image, and makes sure every corresponding line in the box file is set to the correct character from the truth file. This is handy and very, very time-saving. From the readme:

-align_boxfile.py: Changes a boxfile to match a ground-truth text file. Will abort and complain if the number of boxes doesn’t match the number of characters in the file, so run this only after your boxes are in the right places.

I didn’t have to change anything in the script. To run it:

python align_boxfile.py text.txt eng.OCRA.exp0.box

It is taking in the truth file we used above, text.txt, and updating in-place the box file we generated, eng.OCRA.exp0.box.

There are other tools in tess_school, but I found its auto_train.sh script to not generate as good of a traineddata file as I got via other means, and I had no use for the other scripts at this time, so we will leave tess_school behind now.

Step 6 – Training Tesseract

At this point I wrote a script called trainingtess to finish all the remaining steps in training Tesseract. I won’t go through it in detail (the Resources section above has all the gory details). The script is as follows:

#!/bin/bash
tesseract zzz.ocra.exp0.tif zzz.ocra.exp0 nobatch box.train
unicharset_extractor zzz.ocra.exp0.box
echo "ocra 0 0 1 0 0" >font_properties
shapeclustering -F font_properties -U unicharset zzz.ocra.exp0.tr
mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr
cntraining zzz.ocra.exp0.tr
cp normproto zzz.normproto
cp inttemp zzz.inttemp
cp pffmtable zzz.pffmtable
cp shapetable zzz.shapetable
combine_tessdata zzz.
cp zzz.traineddata /home/youruserid/tessdata/.
sudo cp zzz.traineddata /usr/share/tesseract-ocr/tessdata/.
tesseract zzz.ocra.exp0.tif output -l zzz

You have to make the following changes:

  • Rename the “eng” files that came out of the tess_school work to “zzz”, or to change the trainingtess script to use “eng” itself. Your choice. And for your project you may want some other “language” like “foo” or “bar” instead anyway. I also changed the font name in the files from “OCRA” to “ocra” to match Tesseract “standards.”
  • Change the font from “ocra” to the appropriate font name for your uses.
  • Change the line that creates the font_properties file appropriately. Its format is:
    • Font name – as used in the file names, here ocra.
    • Italic1 if training for italic font. 0 in our example.
    • Bold1 if training for bold font. 0 in our example.
    • Fixed1 if training on a fixed (monospaced) font. 1 in our example, since OCR-A is a fixed font.
    • Serif1 if the font has serifs. 0 in our example.
    • Fraktur1 if the font is a “Fraktur” font (aka “blackletter” or “Olde English/Gothic” font). 0 in our example.
  • You will also want to change where it copies the output traineddata file.

For input, it will need the .tif and .box files generated by the tess_school scripts to be in the same directory as trainingtess. You then simply run it:

    ./trainingtess

You will get a lot of output files from it, including:

font_properties
inttemp
normproto
output.txt
pffmtable
shapetable
unicharset
zzz.inttemp
zzz.normproto
zzz.ocra.exp0.box
zzz.ocra.exp0.tif
zzz.ocra.exp0.tr
zzz.ocra.exp0.txt
zzz.pffmtable
zzz.shapetable
zzz.traineddata
zzz.unicharset

Out of all those, and out of all this work, the one we’re interested in is the zzz.traineddata file. That is what will go in your tessdata directory of your project. The other interesting file is the output.txt file, because that shows the output from the last step in the script, which ran Tesseract with the new traineddata file on the .tif file and had it OCR the image and output what characters it found. If you did everything right, it should be the same characters that are in the image file. If so, you have been successful! Good job, citizen!

Other Issues

I did not have to train Tesseract for “words,” which is where the whole language thing really comes into play, with dictionaries and files to help “unambiguate” similar characters, especially when considering kerning issues, e.g., distinguishing “rn” from “m” in a sans serif font. That is more useful when you are trying to OCR entire documents into English, for example.

The Tesseract language files that are on the project site are already pre-trained for some common (mostly sans serif) fonts. If you are trying to do real language processing I would start with those files and hope they “just work.” If not, you have a lot of work ahead of you. At that point I would start considering a commercial package.


  1. It could possibly work on Cygwin, but I didn’t try it.