RESOURCE: Using Kraken to Train Your Own OCR Models

dh+lib 2019-11-14

Christine Roughan, PhD student at NYU, has created a guide on how to train and implement OCR models using Kraken. Kraken is open-source command line software for performing OCR on text, and offers both pre-trained OCR models and the ability to produce artificial training data from a text provided by the user.

This guide is a basic walkthrough on downloading and running Kraken, preparing artificial training data, generating artificial training data, training and fine-tuning your model, and performing OCR on your text(s). The author uses an Arabic text as an example, but the guide’s steps are reproducible with any language. It is worth noting that the walkthrough does not cover initial preparation of the images to be processed, so if starting from a PDF the pages will have to be separated into individual image files using a tool like pdftoppm or ImageMagick’s convert tool. The author notes that she has been able to use Kraken with PNG, TIFF, and JPG files.

This resource is a very helpful introduction to using Kraken for performing OCR and creating your own training data. It will be of particular interest to anyone working with non-Roman languages or who would like to train and implement their own OCR models rather than relying on pre-made models that come packaged with OCR software.