Overview

The Tesseract HOCR plugin by default is a part of page processing.

This plugin reads the image files listed in the batch xml (of a batch), generates HOCR file for each one of them and updates its batch.xml.

Configuration

Configurable Properties

Following are the list of configurable properties for Tesseract HOCR plugin from UI:-

400px-BatchClassManagement_TesseractHOCRPlugin

Configurable property Type of value Value options Description
Tesseract Switch  List of values
  • ON
  • OFF

 

This switch is used to turn this plugin ON/OFF. If this switch is OFF, this plugin won’t do anything. 

 

Tesseract color switch  List of values
  • ON
  • OFF

 

Tesseract is unable to read colored TIFFs. Hence, in case of colored images (i.e. when one switches ON the color switch), it send the PNGs for OCRing instead.Hence switching the color switch ON would be helpful for batch classes where one expects to have colored TIFF images.
Tesseract Language  String NA This option provides the user an option to select the language one wants to use for OCRing. At present Tesseract supports only single language per image file OCRing.E.g.: specify ‘eng’ for English, ‘tur’– for Turkish etc.
Tesseract Version  String NA This option provides the user an option to define the Tesseract version installed in system.E.g.: specify ‘tesseract_version_3’ for Tesseract 3.0, ‘tesseract_version_2’– for Tesseract 2.0 etc.
Tesseract Valid Extensions  Multi-select
  • tif
  • gif
  • png

 

This property holds an integer value which decides on <some logic>. (Also mention range if applicable)

Steps of execution

  • This plug-in works in the page process phase of the application when all the import processing on the batch has been done and it’s ready to be page processing.
  • The plug-in does OCRing for all the input images.
  • After all the work is done, it writes the name of each HOCR file in its batch.xml and generates HOCR output in the form of html and HOCR.xml.

Dependency

This plugin only requires an image as an input (which is a PNG if color switch is ON and a TIFF if color switch is OFF). Hence one would require one of the plugins from: ‘Create OCR Input Plugin’/ ‘Create Display Image Plugin’ to run before it.

Troubleshooting

Following are few common error messages received due to mal-functioning of the plugin:

 

S no. Error message Possible root cause
1 Tesseract Base path not configured. Environment variable for Tesseract is either not set or path is configured incorrectly.
2 Space found in the name of image: xyz.png. So it cannot be processed Please remove spaces from image name and restart the batch from page process module.
3 No valid extensions are specified in resources Valid Extensions for input image files is not specified.
4 Image Processing or XML updating failed for image: xyz Image file given as input is having an extension other than specified in property ‘Tesseract Valid Extensions’

Was this article helpful to you?

Walter Lee

Comments are closed.