Tesseract HOCR plugin can be added in page processing module. This plugin performs HOCRing on input image and populates result in XML file.
This plugin reads image files listed in the batch xml (of a batch) and generates HOCR file for each one of them and updates its batch.xml.
Following is the list of configurable properties for Tesseract HOCR plugin:-
|Configurable property||Type of value||Value options||Description|
|Tesseract Switch||List of values||ON
|This switch is used to turn this plugin ON/OFF. If this switch is OFF, this plugin won't do anything.|
|Tesseract color switch||List of values||ON
|Tesseract is unable to read colored TIFFs. Hence, in case of colored images (i.e. when one switches ON the color switch), we send the PNGs for OCRing instead.
Hence switching the color switch ON would be helpful for batch classes where one expects to have colored TIFF images.
|Tesseract Language||String||NA||This option provides the user an option to select the language one wants to use for OCRing. At present Tesseract supports only single language per image file OCRing.
E.g.: specify 'eng' for English, 'tur'- for Turkish etc.
|Tesseract Version||String||NA||This option provides user an option to define the Tesseract version installed in system.
E.g.: specify 'tesseract_version_3' for Tesseract 3.0, 'tesseract_version_2'- for Tesseract 2.0 etc.
|Tesseract Valid Extensions||Multi-select|| tif
|Tesseract allows the following three formats for OCRing. One can configure the allowable format of image for OCRing .|
Steps of execution
- This plug-in works in the page process phase of the application when all the import processing on the batch has been done and it’s ready to be page processing.
- The plug-in performs OCRing for all the input images.
- After all the processing is done, it writes the name of each HOCR file in its batch.xml and generates HOCR output in the form of HOCR.xml.
This plugin only requires an image as an input (which is a PNG if color switch is ON and a TIFF if color switch is OFF). Hence one of the plugins from: ‘Create OCR Input Plugin’/ ‘Create Display Image Plugin’ must be executed before this plugin.
Following are few common error messages received due to mal-functioning of the plugin:
|S. No.||Error message||Possible root cause|
|1||Tesseract Base path not configured.||Environment variable for Tesseract is either not set or path is configured incorrectly.|
|2||Space found in the name of image: xyz.png. So it cannot be processed||Please remove spaces from image name and restart the batch from page process module.|
|3||No valid extensions are specified in resources||Valid Extensions for input image files has not been specified.|
|4||Image Processing or XML updation failed for image: img.tif||Image file given as input have an extension other than specified in property 'Tesseract Valid Extensions'|