Tesseract is an open source OCR tool originally developed by HP and now used by Google and others. The source repository is here:
It is primarily a command-line tool. However, there is a library that is available for programmatic access, which has then been ported to Windows. In addition, there is a .NET wrapper API for it available on GitHub and also in NuGet, so installing it in a Visual Studio project is easy:
Before Tesseract can be used, it must be “trained.” This involves a series of steps that teach it both the font(s) that will be used as well as the language. There are defaults for many languages and standard fonts (Arial, etc.) on the project site itself, including the default one for English:
However, if you want to use a different, “non-standard” font, such as OCR-A, you must train Tesseract for that. This document explains how I trained Tesseract for OCR-A for a work project.
You will want to become familiar with all of the following. Really. Even though in the end the steps I give should “just work,” understanding what is going on is helpful.
- The official Tesseract training guide, which is very detailed - https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
- An easier-to-follow write-up - http://www.michaeljaylissner.com/blog/adding-new-fonts-to-tesseract-3-ocr-engine
- The “official” Tesseract training data, useful for general purpose OCR, but I didn’t end up using it - https://github.com/tesseract-ocr/tessdata/releases
- A problem ticket that shows the kind of problems that comes up. The comment thread actually ended up helping me out – https://code.google.com/p/tesseract-ocr/issues/detail?id=629
- A set of Python scripts I ended up using, but only partially because they didn’t all work for me – https://github.com/ddohler/tess_school
- You may end up needing a “box editor,” although the steps below don’t use it. Here is one I tried written in Python – https://github.com/ddohler/moshpytt
The following describes what to do to create a new
traineddata file, which then goes in the
tessdata folder that will be under your project’s executable location. In other words, if your project executes out of
C:\Program Files\Foo, then the
tessdata folder should be at
You will have to pick a font (in our example OCR-A) and a three-character language code. The language doesn’t have to be real. In our example we use “zzz” as the language code. When you use Tesseract, you tell it which language to load.
All of the following assumes you have access to a Linux box1 with the following installed:
- Python 2
- Pango and Cairo (installed with Python)
Step 1 – Convert “truth file” to image
This uses the
text2img.py script in
tess_school. Per the readme:
-text2img.py: Takes a ground-truth text file and automatically generates image files from the text, for use in training tesseract. Everything is hard coded at the moment, no command-line options yet. Eventually I’d like to have this generate the boxfile too.
First I changed the hardcoded language in the script from ka/kat to en/eng, i.e.:
LANG = "en" TESS_LANG = "eng"
Note: Even though I ended up using “zzz” as the language, it was easier in the interim to work with “eng” because a lot of samples on the Internet assume it.
I then ran it in the shell as follows:
python text2img.py -f text.txt OCRA
Let’s pull that apart a bit. Besides the script itself, there are two things of interest:
text.txtis the input file. It contains the characters we are going to use for training. In this case I made the file a simple one that had all upper and lowercase letters, numbers, and all special characters accessible on a keyboard. Here are the contents:
OCRAis the name of the font to output the text as in the resulting image. It has to be the official name of a font installed on the system you are running the script on. On Debian/Ubuntu/Mint flavors, fonts are installed under
/usr/share/fonts, and then you have to know what type of font it is, such as OpenType (
.otf) or TrueType (
.ttf). In our case OCR-A is a TrueType font, and the files are located at
/usr/share/fonts/truetype/ocr-a. In that directory are the following files, and by using
OCRAwe are telling it to use the main OCR-A font:
This brings up another point. When training for fonts, you have to train for normal vs. bold vs. italic vs. bold and italic together all separately. In our case we just want the normal font, so that keeps things simple.
The output of the
text2img.py execution will be a file in the same directory named
eng.OCRA.exp0.png. That is the language (“eng”), the font we chose (“OCRA”), the “exp0” is used to allow combining multiple training files together (for example, you could have “exp0” be for the normal font face, “exp1” be for bold, etc.).
Step 2 – Convert the PNG to a TIFF file
While Tesseract can use PNG files for training, apparently it works better with TIFFs, per the
-png2tif.sh: Uses ImageMagick to convert the PNG output from text2img.py to TIFF files. Tesseract can read PNG files, but sometimes seems to prefer TIFF.
No changes were required. Just run the script:
You will then have a file in the same directory called
Step 3 – Make the “box” file(s)
This step is where the rubber meets the road. It involves letting Tesseract loose on the image we’ve produced and see if it can figure out where the characters are and what they are. A “box” file is simply a text file with a simple format:
- The character Tesseract guessed in a specific location.
- Four fields that are the coordinates of the character.
- A page number (zero-relative), which for us is always zero (training multi-page docs is harder).
make_boxes.sh script generates the box file(s), looking for any
.tif files in the directory. The only thing that needs to be changed is the language code, which again I changed to “eng”:
Then you just run the script:
You will then have a
eng.OCRA.exp0.box file in the same directory.
Step 4 – Merge adjacent boxes
Tesseract can sometimes “see” multiple characters where there is in reality only one. This script helps fix that, per the readme:
-merge_boxes.py: Merges nearby boxes in a boxfile resulting from tesseract oversegmenting characters. This is a common error that Tesseract makes and this script will quickly fix most instances of this problem.
I had to make no changes to the script. You run it with:
python merge_boxes.py -d eng.OCRA.exp0.box
-d parameter indicates it’s a “dry run” and will just indicate if there were any boxes that needed merging. In my case there weren’t. See the script for other parameters in case the dry run indicates there might be boxes needing merging.
Step 5 – Align the box file to the truth file
In a typical Tesseract training, you would then go through the box file with a box file editor such as Moshpytt, checking and correcting each and every character. On large box files that is a complete PITA, and from my testing large box files (lots of input characters) didn’t seem to significantly increase the accuracy. YMMV. Instead,
tess_school has a script that automatically takes in the truth file used to generate the image, and makes sure every corresponding line in the box file is set to the correct character from the truth file. This is handy and very, very time-saving. From the readme:
-align_boxfile.py: Changes a boxfile to match a ground-truth text file. Will abort and complain if the number of boxes doesn’t match the number of characters in the file, so run this only after your boxes are in the right places.
I didn’t have to change anything in the script. To run it:
python align_boxfile.py text.txt eng.OCRA.exp0.box
It is taking in the truth file we used above,
text.txt, and updating in-place the box file we generated,
There are other tools in
tess_school, but I found its
auto_train.sh script to not generate as good of a
traineddata file as I got via other means, and I had no use for the other scripts at this time, so we will leave
tess_school behind now.
Step 6 – Training Tesseract
At this point I wrote a script called
trainingtess to finish all the remaining steps in training Tesseract. I won’t go through it in detail (the Resources section above has all the gory details). The script is as follows:
#!/bin/bash tesseract zzz.ocra.exp0.tif zzz.ocra.exp0 nobatch box.train unicharset_extractor zzz.ocra.exp0.box echo "ocra 0 0 1 0 0" >font_properties shapeclustering -F font_properties -U unicharset zzz.ocra.exp0.tr mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr cntraining zzz.ocra.exp0.tr cp normproto zzz.normproto cp inttemp zzz.inttemp cp pffmtable zzz.pffmtable cp shapetable zzz.shapetable combine_tessdata zzz. cp zzz.traineddata /home/youruserid/tessdata/. sudo cp zzz.traineddata /usr/share/tesseract-ocr/tessdata/. tesseract zzz.ocra.exp0.tif output -l zzz
You have to make the following changes:
- Rename the “eng” files that came out of the
tess_schoolwork to “zzz”, or to change the
trainingtessscript to use “eng” itself. Your choice. And for your project you may want some other “language” like “foo” or “bar” instead anyway. I also changed the font name in the files from “OCRA” to “ocra” to match Tesseract “standards.”
- Change the font from “ocra” to the appropriate font name for your uses.
- Change the line that creates the
font_propertiesfile appropriately. Its format is:
- Font name – as used in the file names, here
- Italic –
1if training for italic font.
0in our example.
- Bold –
1if training for bold font.
0in our example.
- Fixed –
1if training on a fixed (monospaced) font.
1in our example, since OCR-A is a fixed font.
- Serif –
1if the font has serifs.
0in our example.
- Fraktur –
1if the font is a “Fraktur” font (aka “blackletter” or “Olde English/Gothic” font).
0in our example.
- Font name – as used in the file names, here
- You will also want to change where it copies the output
For input, it will need the
.box files generated by the
tess_school scripts to be in the same directory as
trainingtess. You then simply run it:
You will get a lot of output files from it, including:
font_properties inttemp normproto output.txt pffmtable shapetable unicharset zzz.inttemp zzz.normproto zzz.ocra.exp0.box zzz.ocra.exp0.tif zzz.ocra.exp0.tr zzz.ocra.exp0.txt zzz.pffmtable zzz.shapetable zzz.traineddata zzz.unicharset
Out of all those, and out of all this work, the one we’re interested in is the
zzz.traineddata file. That is what will go in your
tessdata directory of your project. The other interesting file is the
output.txt file, because that shows the output from the last step in the script, which ran Tesseract with the new
traineddata file on the
.tif file and had it OCR the image and output what characters it found. If you did everything right, it should be the same characters that are in the image file. If so, you have been successful! Good job, citizen!
I did not have to train Tesseract for “words,” which is where the whole language thing really comes into play, with dictionaries and files to help “unambiguate” similar characters, especially when considering kerning issues, e.g., distinguishing “rn” from “m” in a sans serif font. That is more useful when you are trying to OCR entire documents into English, for example.
The Tesseract language files that are on the project site are already pre-trained for some common (mostly sans serif) fonts. If you are trying to do real language processing I would start with those files and hope they “just work.” If not, you have a lot of work ahead of you. At that point I would start considering a commercial package.
It could possibly work on Cygwin, but I didn’t try it.↩