What’s New In Transact 4.5?
OCR | OCR Languages Selection from the UI
Ephesoft Transact supports three OCR engines: Tessaract, Recostar (for Windows), or Nuance (for Linux). The user can select any one of them depending on his preference and system requirements.
Previously, to define OCR languages for Recostar/Nuance, the user had to find the required backend folders on the server and edit the OCR input file manually. Tessaract OCR language could be specified from the UI – for that the language name had to be manually typed in the corresponding field.
In Ephesoft Transact v.22.214.171.124, a new multi-select-suggestion widget has been added to the Plugin Configuration screen for all three OCR engines under the Page Process module. Using this widget, you can select the language(s) and update the OCR engine input file automatically from the UI rather than doing it manually.
The name of the new Plugin Configuration field for Nuance and Tesseract OCR engines is “OCR Language”. The Recostar OCR engine, on the other hand, takes only the country name as the language input, therefore, to make it compatible with other definitions, the same field in the Recostar HOCR plugin is called “OCR Country/Language”.
TESSARACT_HOCR Plugin Configuration screen
NUANCE_HOCR Plugin Configuration screen
RECOSTAR_HOCR Plugin Configuration screen
When you select or type the language name, the widget will help you by giving suggestions. The complete suggestion list will be opened by the suggestion token, which is a semi-colon (;) or by clicking in the field with predictive typing if no language is selected. The suggestion token will automatically list languages based on the user’s input.
When you start typing the first letters of the required language name, the widget will suggest languages according to the letters already entered.
The multi-select-suggestion widget has several icons associated with it:
– Help icon is used to provide suggestions (for example, it will remind you to use suggestion token to view the language suggestions list (;).
– Error icon is used to indicate that you have provided/selected wrong input (for example, if you leave the field empty or enter invalid value).
Note: The error icon will also be shown if you select/use a non-licensed language for Nuance (Arabic and Asian (Chinese_Simplified, Chinese_Traditional, Japanese, Korean) languages) or Recostar (Chinese, Japanese, Korean, Thai).
– Warning icon is used to warn and provide information (for example, it will remind that Tessaract Test-Data folder should contain Test Data for selected Tesseract languages).
- If you do not specify the language in the HOCR plugin, English will be used by default.
- During OCRing procedure with Recostar/Nuance OCR engine, the system will check whether all selected languages are licensed. If not, then the empty HOCR will be generated for all pages and an error log will be created in the log file.
- If you need to OCR documents in Asian languages using the Recostar OCR engine, you’ll have to purchase additional Ephesoft OCR language license for Asian languages (Chinese, Japanese, Korean, Thai). Similarly, when using Nuance, separate licenses have to be purchased for Arabic language and Asian languages (Chinese_Simplified, Chinese_Traditional, Japanese, Korean).
The information about selected language(s) is now also included in the HOCR.xml file. The file will contain the <LanguageCode> tag with the code of the OCR language(s) specified in the TESSARACT_HOCR, RECOSTAR_HOCR or NUANCE_HOCR plugin.