Overview

The Tesseract HOCR plugin by default is a part of page processing.

This plugin reads the image files listed in the batch xml (of a batch), generates HOCR file for each one of them and updates its batch.xml.

Configuration

Configurable Properties

Following are the list of configurable properties for Tesseract HOCR plugin from UI:-

Configurable property Type of value Value options Description
Tesseract Switch List of values
  • ON
  • OFF
This switch is used to turn this plugin ON/OFF. If this switch is OFF, this plugin won’t do anything.
Tesseract color switch List of values
  • ON
  • OFF
Tesseract is unable to read colored TIFFs. Hence, in case of colored images (i.e. when one switches ON the color switch), we send the PNGs for OCRing instead.Hence switching the color switch ON would be helpful for batch classes where one expects to have colored TIFF images.
Tesseract Language String NA This option provides the user an option to select the language one wants to use for OCRing. At present Tesseract supports only single language per image file OCRing.E.g.: specify ‘eng’ for English, ‘tur’– for Turkish etc.
Tesseract Version String NA This option provides the user an option to define the Tesseract version installed in system.E.g.: specify ‘tesseract_version_3’ for Tesseract 3.0, ‘tesseract_version_2’– for Tesseract 2.0 etc.
Tesseract Valid Extensions List of values
  • tif
  • gif
  • png
This property holds an integer value which decides on <some logic>. (Also mention range if applicable)

Steps of execution

    • This plug-in works in the page process phase of the application when all the import processing on the batch has been done and it’s ready to be page processing.
    • The plug-in does OCRing for all the input images.
    • After all the work is done, it writes the name of each HOCR file in its batch.xml and generates HOCR output in the form of html and HOCR.xml.

Dependency

This plugin only requires an image as an input (which is a PNG if color switch is ON and a TIFF if color switch is OFF). Hence one would require one of the plugins from: ‘Create OCR Input Plugin’/ ‘Create Display Image Plugin’ to run before it.

Troubleshooting

Following are few common error messages received due to mal-functioning of the plugin:

S no. Error message Possible root cause
1 Tesseract Base path not configured. Environment variable for Tesseract is either not set or path is configured incorrectly.
2 Space found in the name of image: xyz.png. So it cannot be processed Please remove spaces from image name and restart the batch from page process module.
3 No valid extensions are specified in resources Valid Extensions for input image files is not specified.
4 Image Processing or XML updation failed for image: xyz Image file given as input is having an extension other than specified in property ‘Tesseract Valid Extensions’

 

<Back| 4.0.0.0 Release Documentation

Was this article helpful to you?

Engineering

Comments are closed.