Last Updated on

What’s New In Transact 4.5?


OCR | Fraud Detection Using OCR Font Switch

 

In Ephesoft Transact v4.5.0.0, a new Font Recognition switch has been introduced to detect potential fraud and tampering with processed documents. The HOCR file reflects the font style (Bold, Italics, and Underline) and font size if the Font switch is turned ON in the RECOSTAR_HOCR or NUANCE_HOCR plugins. This allows the user to detect any data that has been manually altered or added to the documents. By default, the Font switch is set to OFF.

For example, the original amount of a field in a document is “1000” and the font size is 11. Assume this value is manually changed to “41000” and the “4” is written in a size 12 font. The system will recognize the font size and style in the HOCR file. This will help the user identify that the document has been tampered with.

Note: Tesseract does not provide any information on font detection. This feature is available only in the Recostar and Nuance OCR engines.

The following changes have been made to implement this feature:

  • The HOCR schema has been revamped to include font information from the data fetched by the Recostar and Nuance OCR engines.
  • ON/OFF switches have been added to the RECOSTAR_HOCR and NUANCE_HOCR plugins which the user can configure to retrieve font information.
  • The following Web Services have been modified to include font information in the HOCR file:
    • ocrClassifyExtract
    • initiateOCRClassifyExtractService
    • OcrClassifyExtractSearchablePDF
    • executeMobileUpload
    • extractFieldFromHocr
    • extractKV
    • classifyImage
    • classifyBarcodeImage
    • classifyHocr
    • classifyMultiPageHocr
    • decryptBatchInstanceHocrXml
    • decryptLuceneClassificationHocrXml
    • decryptTestHocrXml
    • keywordClassification
    • ocrClassify
    • ExtractKVForDocumentType
    • createHOCRforBatchClass
    • tableExtractionHOCR
  • The following Web Service can be configured to obtain font information in the HOCR file:
    • createOCR (a new parameter fontSwitch with an ON/OFF setting has been added to the input .xml file)

 

RECOSTAR_HOCR Plugin

C:\Users\Ephesoft\AppData\Local\Microsoft\Windows\INetCache\Content.Word\2.png

 

To fetch the font information via the RECOSTAR_HOCR plugin:

  1. Navigate to Batch Class > Modules > Page Process and add the RECOSTAR_HOCR plugin.
  2. Open the RECOSTAR_HOCR plugin and turn ON the Recostar Font Switch.
  3. Click Apply to update the Batch Class configuration.
  4. Click Deploy to apply the changes in the workflow.
  5. Process a batch to verify the changes in the HOCR schema.

The newly generated HOCR schema now includes the font size of each character in the span. A tag entitled “UnicodeCharacters” has been added to the HOCR file which contains information about the value and size of each character. Also, a tag entitled “Style” has been added in the HOCR file which contains information about the style (Bold, Italics, and Underline) of the span. If style information is not fetched, its value is “None”.

 

In the screenshot below you can see the difference in the HOCR schema when the Font switch is turned OFF.

  1. Turn OFF the RECOSTAR FONT SWITCH and save your changes.
  2. Process a batch.

The information about font family and size is not fetched when the switch is turned OFF.

Note: The Recostar OCR engine does not recognize combinations of font styles. For example, the style value would be “None” if a character string was both bold and underlined.

 

 

NUANCE_HOCR Plugin

C:\Users\Ephesoft\AppData\Local\Microsoft\Windows\INetCache\Content.Word\3.png

 

To fetch font information via the NUANCE_HOCR plugin:

  1. Navigate to Batch Class > Modules > Page Process and add the NUANCE_HOCR plugin.
  2. Open the NUANCE_HOCR plugin and turn ON the Nuance Font Switch.
  3. Click Apply to update the Batch Class configuration.
  4. Click Deploy to apply the changes in the workflow.
  5. Process a batch with the Font Switch ON.

The newly generated HOCR schema now includes the font size of each character in the span. A tag entitled “UnicodeCharacters” has been added to the HOCR file which contains information about the value and size of each character. Also, a tag entitled “Style” has been added in the HOCR file which contains information about the style (Bold, Italics, and Underline) of the span. If style information is not fetched, its value is “None”.

C:\Users\Ephesoft\AppData\Local\Microsoft\Windows\INetCache\Content.Word\1.png

 

In the screenshot below you can see the difference in the HOCR schema when the Font switch is turned OFF.

  1. Turn OFF the NUANCE FONT SWITCH and save your changes.
  2. Process a batch.

The information about font family and size is not fetched when the switch is turned OFF.

Note: The Nuance OCR engine does recognize the combination of font styles, giving comma separated values when multiple styles are detected. However, it does not recognize the character size of individual characters. All characters in a word are always recognized as having the same size, even though some letters might be capitalized.