Last Updated on

Machine Learning SDK APIs and Web Services

In Ephesoft Transact v.4.5.0.0, new APIs and Web Services have been created for machine learning and machine learning based extraction. SDK contains seven APIs: three for machine learning and four for machine extraction. There are also four new Web Services — two for machine learning and two for extraction. These new APIs consume provided data, process it and return learned or extracted information. They can be used with or without Ephesoft Transact specific parameters.

MACHINE LEARNING SDK

APIs for Machine Learning

Learn document field with Ephesoft Transact specific parameters

API Name generateModel()
Structure public void generateModel(final String documentTypeName, final String docFieldName, final FieldData fieldData, final String hocrFilePath, final CustomParameters customParameters, final String learningFilePath, final String dictionaryPath) throws MLException;
Description Learns document field on the basis of parameters specific to Ephesoft Transact.
Parameters documentTypeName – Name of the document type, in which index field needs to be extracted. The same document name must be used when invoking SDK extraction API.

docFieldName – Name of the index field which needs to be extracted. The same field names must be used when invoking SDK extraction API.

fieldData – Model to specify the coordinates and the type of the value to be learned. Contains the following attributes:

  • docFieldType – Value type of the index field. For non-predefined type, customType attribute of ValueTypeDto is mandatory.
  • coordinate – Coordinates of docFieldValue on the page. These coordinates must be fetched from the HOCR xml file (see next parameter).

hocrFilePath – Absolute path of the HOCR xml file or folder containing one or more HOCR xml files. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml.

customParameters – All custom regex that should be identified as potential values. Must contain entries if non-predefined types are used for docFieldType. The custom object contains the following attributes:

  • customBlockDatas: List of custom block data used for Custom_block type.
    • valueTypeDto: List of types in sequential order to be identified as block.
      • customType: Custom name of the types other than predefined type. Empty if predefined type is selected.
      • types: Predefined type mapper. Type.CUSTOM and Type.CUSTOM_BLOCK are used for custom and custom block value type.
    • typeName: Name of the custom block type.
  • customRegexDatas: List of custom regex data used for the custom type.
    • regex: Regex of the custom type.
    • typeName: Name of the custom type.

learningFilePath – Path to the folder where the learning file gets generated.

dictionaryPath – Path to the folder where machine learning dictionaries are located. Sample path: E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries.

Returns
Throws Custom exception of MLException type.
Comments All input parameters are mandatory, except dictionaryPath.

Machine learning dictionaries are shipped along with the jar files. Dictionaries can be customized as required.

Learn document field independent of Ephesoft Transact specific parameters

API Name generateModel()
Structure public void generateModel(final FieldData fieldData, final String hocrFilePath, final CustomParameters customParameters, final String learningFilePath, final String dictionaryPath) throws MLException;
Description Learns document field independent of parameters specific to Ephesoft Transact. Since no Transact parameters are applied, by default, the document type name in the JSON file will be ‘Doc1’ and the index field name will be ‘Field1’.
Parameters fieldData – Model to specify the coordinates and the type of the value to be learned. Contains the following attributes:

  • docFieldType – Value type of the index field. For non-predefined type, customType attribute of ValueTypeDto is mandatory.
  • coordinate – Coordinates of docFieldValue on the page. These coordinates must be fetched from the HOCR xml file (see next parameter).

hocrFilePath – Absolute path of the HOCR xml file or folder containing one or more HOCR xml files. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml.

customParameters – All custom regex that should be identified as potential values. Must contain entries if non-predefined types are used for docFieldType. The custom object contains the following attributes:

  • customBlockDatas: List of custom block data used for Custom_block type.
    • valueTypeDto: List of types in sequential order to be identified as block.
      • customType: Custom name of the types other than predefined type. Empty if predefined type is selected.
      • types: Predefined type mapper. Type.CUSTOM and Type.CUSTOM_BLOCK are used for custom and custom block value type.
    • typeName: Name of the custom block type.
  • customRegexDatas: List of custom regex data used for the custom type.
    • regex: Regex of the custom type.
    • typeName: Name of the custom type.

learningFilePath – Path to the folder where the learning file gets generated.

dictionaryPath – Path to the folder where machine learning dictionaries are located. Sample path: E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries.

Returns
Throws Custom exception of MLException type.
Comments All input parameters are mandatory, except dictionaryPath.

Machine learning dictionaries are shipped along with the jar files. Dictionaries can be customized as required.

Learn document field using Fuzzy Matching flag

API Name generateModel()
Structure public void generateModel(final String documentTypeName, final String docFieldName, final FieldData fieldData, final String hocrFilePath, final CustomParameters customParameters, final String learningFilePath, final String dictionaryPath, final boolean fuzzyMatch) throws MLException;
Description Learns document field for Fuzzy Matching
Parameters documentTypeName – Name of the document type, in which index field needs to be extracted. The same document name must be used when invoking SDK extraction API.

docFieldName – Name of the index field which needs to be extracted. The same field names must be used when invoking SDK extraction API.

fieldData – Model to specify the coordinates and the type of the value to be learned. Contains the following attributes:

  • docFieldType – Value type of the index field. For non-predefined type, customType attribute of ValueTypeDto is mandatory.
  • coordinate – Coordinates of docFieldValue on the page. These coordinates must be fetched from the HOCR xml file (see next parameter).

hocrFilePath – Absolute path of the HOCR xml file or folder containing one or more HOCR xml files. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml.

customParameters – All custom regex that should be identified as potential values. Must contain entries if non-predefined types are used for docFieldType. The custom object contains the following attributes:

  • customBlockDatas: List of custom block data used for Custom_block type.
    • valueTypeDto: List of types in sequential order to be identified as block.
      • customType: Custom name of the types other than predefined type. Empty if predefined type is selected.
      • types: Predefined type mapper. Type.CUSTOM and Type.CUSTOM_BLOCK are used for custom and custom block value type.
    • typeName: Name of the custom block type.
  • customRegexDatas: List of custom regex data used for the custom type.
    • regex: Regex of the custom type.
    • typeName: Name of the custom type.

learningFilePath – Path to the folder where the learning file gets generated.

dictionaryPath – Path to the folder where machine learning dictionaries are located. Sample path:

E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries.

fuzzyMatch – Boolean Flag. True when client needs fuzzy

matching, otherwise false.

Returns
Throws Custom exception of MLException type.
Comments All input parameters are mandatory, except dictionaryPath.

Machine learning dictionaries are shipped along with the jar files. Dictionaries can be customized as required.

APIs for Machine Learning Based Extraction

Extract field value using parameters specific to Ephesoft Transact

API Name extractField()
Structure public List<FieldType> extractField(final String folderPath, final String hocrFilesPath, final String docTypeName, final List<String> docFieldName, final String dictionaryPath) throws MLException;
Description Extracts field value on the basis of parameters specific to Ephesoft Transact.
Parameters folderPath – Absolute path of the folder containing machine learning files.

hocrFilePath – Absolute path of the HOCR xml file or folder containing one or more HOCR xml files. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml.

documentTypeName – Name of the document type, in which index field needs to be extracted. The name of the document type should be exactly the same as the name given during machine learning.

docFieldName – List of names for the index fields which need to be extracted. The names of the index fields should be exactly the same as the names given during machine learning.

dictionaryPath – Path to the folder where machine learning dictionaries are located. Sample path: E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries.

Returns List of the FieldType objects.
Throws Custom exception of the MLException type.
Comments All input parameters are mandatory, except dictionaryPath.

Extract field value independent of Ephesoft Transact specific parameters

API Name extractField()
Structure public List<FieldType> extractField(final String folderPath, final String hocrFilesPath, final String dictionaryPath) throws MLException;
Description Extracts field value independent of parameters specific to Ephesoft Transact.
Parameters folderPath – Absolute path of the folder containing machine learning files.

hocrFilePath – Absolute path of the HOCR xml file or folder containing one or more HOCR xml files. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml.

dictionaryPath – Path to the folder where machine learning dictionaries are located. Sample path: E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries. Since no Transact parameters are applied, by default, the document type name in the JSON file will be ‘Doc1’ and the index field name will be ‘Field1’.

Returns List of the FieldType objects.
Throws Custom exception of the MLException type.
Comments All input parameters are mandatory, except dictionaryPath.

Machine learning dictionaries are shipped along with the jar files. Dictionaries can be customized as required.

Extract field value using Ephesoft Transact specific parameter and customized cut-off threshold

API Name extractField()
Structure public List<FieldType> extractField(final String folderPath, final String hocrFilesPath, final String docTypeName, final List<String> docFieldName, final float threshold, final String dictionaryPath) throws MLException;
Description Extracts field value using Ephesoft Transact specific parameter and customized cut-off threshold.

The cut-off feature is used in the machine learning based extraction algorithm to stop extraction once the system finds a value that is equal or greater than the threshold defined.

The cut-off threshold is defined at the application level and lies within the range from 0 to 100. It is included in the processing logic and applied by default to all extraction APIs. For the extraction APIs without this parameter, default value is 95.

This API has threshold as an additional parameter, allowing the user to select preferred threshold value.

Parameters folderPath – Absolute path of the folder containing machine learning files.

hocrFilePath – Absolute path of the HOCR xml file or folder containing one or more HOCR xml files. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml.

documentTypeName – Name of the document type, in which index field needs to be extracted. The name of the document type should be exactly the same as the name given during machine learning.

docFieldName – List of names for the index fields which need to be extracted. The names of the index fields should be exactly the same as the names given during machine learning.

threshold – Threshold value to cut-off machine learning extraction. The range for the threshold cut-off value is 0 to 100. Default value of this parameter if passed incorrectly would be 90.

dictionaryPath – Path to the folder where machine learning dictionaries are located. Sample path: E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries.

Returns List of the FieldType objects.
Throws Custom exception of the MLException type.
Comments All input parameters are mandatory, except threshold and dictionaryPath.

Machine learning dictionaries are shipped along with the jar files. Dictionaries can be customized as required.

Extract field value using Fuzzy Matching Flag

API Name extractField()
Structure public List<FieldType> extractField(final String folderPath, final String hocrFilesPath, final String docTypeName, final List<String> docFieldName, final float threshold, final String dictionaryPath, final boolean fuzzyMatch) throws MLException;
Description Extracts field value on the basis of parameters specified
Parameters folderPath – Absolute path of the folder containing machine learning files.

hocrFilePath – Absolute path of the HOCR xml file or folder containing one or more HOCR xml files. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml.

docTypeName – Name of the document type, in which index field needs to be extracted. The name of the document type should be exactly the same as the name given during machine learning.

docFieldName – List of names for the index fields which need to be extracted. The names of the index fields should be exactly the same as the names given during machine learning.

threshold – Threshold value to cut-off machine learning extraction. The range for the threshold cut-off value is 0 to 100. Default value of this parameter if passed incorrectly would be 90.

dictionaryPath – Path to the folder where machine learning dictionaries are located. Sample path:

E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries.

fuzzyMatch – Boolean Flag. True when client needs fuzzy

matching otherwise false.

Returns List of the FieldType objects.
Throws Custom exception of the MLException type.
Comments All input parameters are mandatory, except dictionaryPath.

MACHINE LEARNING WEB SERVICES

Web Services for Machine Learning

generateBatchClassLearningModel

Generates a learning JSON file for the Ephesoft Batch Class.

Web Service URL

POST http://{serverName}:{port}/dcma/rest/generateBatchClassLearningModel

Input Parameters

Parameter Description Required
HOCR file Absolute path of the HOCR xml file. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml. Required
XML file The XML file will include the following parameters:

  • BatchClassIdentifier: Identifier of the batch class.
  • DocumentName: Name of the document type.
  • IndexFieldName: Name of the index field for which learning is to be done.
  • IndexFieldValueType: Type of field value. Must be one of the predefined types.
  • IndexFieldValueTypeName: In case the value type is Custom or Custom_Block.
  • Coordinates: Coordinates of the field value mapped to the HOCR xml file. Field value is computed on the basis of the coordinates and HOCR page.
  • CustomParameters: Custom object that stores regex type, regex and regex type name. It contains the following attributes:
    • customBlockDatas: List of custom block data used for Custom_block type.
      • valueTypeDto: List of types in sequential order to be identified as block.
        • customType: Custom name of the types other than predefined type. Empty if predefined type is selected.
        • types: Predefined type mapper. Type.CUSTOM and Type.CUSTOM_BLOCK are used for custom and custom block value type.
      • typeName: Name of the custom block type.
    • customRegexDatas: List of custom regex data used for the custom type.
      • regex: Regex of the custom type.
      • typeName: Name of the custom type.
  • DictionaryPath: Path to the folder where machine learning dictionaries are located. Sample path: E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries.
Required

Headers

Header Name Description Required
Authorization The access token that will be used to authorize the request. Required

Returns

Successful hit of web service would create JSON file in the batch class “machine-learning-extraction” folder for specific document type and index field.

Sample XML file

The following example illustrates learning of custom_block type:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<MachineLearning targetNamespace="http://www.ephesoft.com/batch/customParams">
<BatchClassIdentifier>BC5</BatchClassIdentifier>
<DocumentName>Doc1</DocumentName>
<IndexFieldName>Field1</IndexFieldName>
<IndexFieldValueType>CUSTOM_BLOCK</IndexFieldValueType>
<IndexFieldValueTypeName>composite_type</IndexFieldValueTypeName>
<Coordinates>
<x0>525</x0>
<y0>242</y0>
<x1>1767</x1>
<y1>270</y1>
</Coordinates>
<CustomParameters>
<customBlockDatas>
<customBlockData>
<typeList>
<valueTypeDto>
<customType>ssn_type</customType>
<types>CUSTOM</types>
</valueTypeDto>
<valueTypeDto>
<customType></customType>
<types>SSN_NO</types>
</valueTypeDto>
</typeList>
<typeName>composite_type</typeName>
</customBlockData>
</customBlockDatas>
<customRegexDatas>
<customRegexData>
<regex>SSN</regex>
<typeName>ssn_type</typeName>
</customRegexData>
</customRegexDatas>
</CustomParameters>
<DictionaryPath>E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries</DictionaryPath>
</MachineLearning>

Note: When adding CustomRegexData and CustomBlockData tags, ensure that no empty values are present in the same. These tags shall be added only with valid values.

generateMachineLearningModel

Generates a JSON file containing machine learning model.

Web Service URL

POST http://{serverName}:{port}/dcma/rest/generateMachineLearningModel

Input Parameters

Parameter Description Required
HOCR file Absolute path of the HOCR xml file. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml. Required
XML file The XML file will include the following parameters:

  • IndexFieldValueType: Type of field value. Must be one of the predefined types.
  • IndexFieldValueTypeName: In case the value type is Custom or Custom_Block.
  • Coordinates: Coordinates of the field value mapped to the HOCR xml file. Field value is computed on the basis of the coordinates and HOCR page.
  • CustomParameters: Custom object that stores regex type, regex and regex type name. It contains the following attributes:
  • customBlockDatas: List of custom block data used for Custom_block type.
    • valueTypeDto: List of types in sequential order to be identified as block.
      • customType: Custom name of the types other than predefined type. Empty if predefined type is selected.
      • types: Predefined type mapper. Type.CUSTOM and Type.CUSTOM_BLOCK are used for custom and custom block value type.
      • typeName: Name of the custom block type.
  • customRegexDatas: List of custom regex data used for the custom type.
    • regex: Regex of the custom type.
    • typeName: Name of the custom type.
  • DictionaryPath: Path to the folder where machine learning dictionaries are located. Sample path: E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries.
Required

Headers

Header Name Description Required
Authorization The access token that will be used to authorize the request. Required

Returns

Successful hit of web service would download a zip folder containing machine learned model (JSON) file.

Sample XML file

The following example illustrates learning of custom_block type:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<MachineLearningSDK targetNamespace="http://www.ephesoft.com/batch/customParams">
<IndexFieldValueType>CUSTOM_BLOCK</IndexFieldValueType>
<IndexFieldValueTypeName>composite_type</IndexFieldValueTypeName>
<Coordinates>
<x0>525</x0>
<y0>242</y0>
<x1>1767</x1>
<y1>270</y1>
</Coordinates>
<CustomParameters>
<customBlockDatas>
<customBlockData>
<typeList>
<valueTypeDto>
<customType>ssn_type</customType>
<types>CUSTOM</types>
</valueTypeDto>
<valueTypeDto>
<customType></customType>
<types>SSN_NO</types>
</valueTypeDto>
</typeList>
<typeName>composite_type</typeName>
</customBlockData>
</customBlockDatas>
<customRegexDatas>
<customRegexData>
<regex>SSN</regex>
<typeName>ssn_type</typeName>
</customRegexData>
</customRegexDatas>
</CustomParameters>
<DictionaryPath>E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries</DictionaryPath>
</MachineLearningSDK>

Note: When adding CustomRegexData and CustomBlockData tags, ensure that no empty values are present in the same. These tags shall be added only with valid values.

For example, if you are learning a field with predefined type “STATE”, then CustomParameters tag should be removed. In that case input.xml will look as follows:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<MachineLearningSDK targetNamespace="http://www.ephesoft.com/batch/customParams">
<IndexFieldValueType>STATE</IndexFieldValueType>
<IndexFieldValueTypeName>STATE</IndexFieldValueTypeName>
<Coordinates>
<x0>525</x0>
<y0>242</y0>
<x1>1767</x1>
<y1>270</y1>
</Coordinates>
</MachineLearningSDK>

Web Services for Machine Learning Based Extraction

machineExtractionForBatchClass

Extracts field value using the training JSON file present in the Ephesoft batch class folder.

Web Service URL

POST http://{serverName}:{port}/dcma/rest/machineExtractionForBatchClass

Input Parameters

Parameter Description Required
HOCR file Absolute path of the HOCR xml file. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml. Required
XML file with listed parameters
  • BatchClassIdentifier: Identifier of the batch class.
  • DocumentName: Name of the document type.
  • IndexFieldName: Name of the index field.
Required

Note: The Web Service will use machine learning dictionary files present in the Batch Class folder of the Batch Class specified in the BatchClassIdentifier parameter.

Headers

Header Name Description Required
Authorization> The access token that will be used to authorize the request. Required

Returns

Successful hit of web service would extract results.

Sample XML file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<MachineExtraction>
<BatchClassIdentifier>BC5</BatchClassIdentifier>
<DocumentName>Doc1</DocumentName>
<IndexFieldName>Field1</IndexFieldName>
</MachineExtraction>

machineExtraction

Extracts field value using the zip file containing JSON files.

Web Service URL

POST http://{serverName}:{port}/dcma/rest/machineExtraction

Input Parameters

Parameter Description Required
HOCR file Absolute path of the HOCR xml file. The format of the user provided HOCR xml file should be the same as that of the Ephesoft HOCR xml. Required
ZIP File The zip file which contains one or more training JSON files Required
XML File Xml file with the following parameters:

  • DocumentName: Name of the document type.
  • IndexFieldNames: List of names for the index fields which need to be extracted.
  • DictionaryPath: Path to the folder where machine learning dictionaries are located. Sample path: E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries.
Required

Headers

Header Name Description Required
Authorization The access token that will be used to authorize the request. Required

Returns

Successful hit of web service would extract results.

Sample XML file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
 <MachineExtractionSDK>
 <DocumentName>Doc1</DocumentName>
 <IndexFieldNames>
 <IndexFieldName>f1</IndexFieldName>
 <IndexFieldName>f2</IndexFieldName>
 <IndexFieldName>f3</IndexFieldName>
 </IndexFieldNames>
 <DictionaryPath>E:\Ephesoft\SharedFolders\BC5\machine-learning-dictionaries</DictionaryPath>
 </MachineExtractionSDK>

Was this article helpful to you?

Vincent Francis