Overview

This Plugin is responsible for forming documents from single pages. This plugin reads all the pages present inside the document type “Unknown” and on the basis of page level fields, creates new documents. The create document plug-in will review page level field results and decide which page is the first page and what is the document type based on page_level_index fields.

Ephesoft supports 5 types of classification i.e. Barcode, Search, Image, Automatic and One Document classification. Only one type of classification can be set for processing the documents at a time for a batch. Also if classification type is set as Automatic then the results of all the three types: Barcode, Image and Lucene page level fields are used and the top results among the three is used to classify pages into documents. Default configuration provided in property file in the order starting from Barcode, then Image and then Lucene search classification.

  • Barcode classification: In Barcode Classification, Ephesoft forms documents on the basis of the bar code present in the processing images. The value of barcode present is extracted in Page Process module, is used to assign the type of page. The pages for which value is extracted is set as starting page of new document whereas the pages for which no barcode is extracted are simply added in existing documents.
  • Search classification: In Search Classification, Ephesoft forms documents on the basis of text found on the images. While learning images in document types, indexes for input files’ text is created and stored. These indexes are matched with the processing images’ text for classifying the images into document types. Text matching results performed on leaned images and images being processed is used for creating documents.
  • Image classification: In Image Classification, Ephesoft forms documents on the basis of the image samples provided for learning. Image Classification is done by superimposing the thumbnails of the learned images and the images being processed to generate the confidence of the two images matching. Using the confidence the images are classified into documents.
  • Automatic classification: In Automatic classification, Ephesoft forms documents on the basis of top results from the page level fields populated by Search Classification (Lucene), Barcode and Image classification Page Process module plugins. Default configuration provided in property file in the order starting from Barcode, then Image and then search classification.
  • One Document classification: In One Document classification, Ephesoft forms only one document and all pages are classified into that document. If multiple documents are configured in the Batch Class then all input images are classified into first document only.

Below described are the properties of the DA plugin:

1.      DA Merge Unknown Document Switch

This feature is used to merge the UNKNOWN documents into a classified document. The processing of the feature is explained below through an example:

For example: Suppose after the algorithm document are classified as listed below:

DOC1 – UNKNOWN

DOC2 – UNKNOWN

DOC3 – DOC_TYPE_1

DOC4 – UNKNOWN

DOC5 – UNKNOWN

DOC6 – DOC_TYEP_2

DOC7 – UNKNOWN

DOC8 – DOC_TYPE_2

Then after this feature has executed following will be the results of the processing:

DOC1 – UNKNOWN

DOC2 – UNKNOWN

DOC3 – DOC_TYPE_1

DOC4 – DOC_TYPE_2

DOC5 – DOC_TYPE_2

The DOC4 and DOC5 got merged into DOC3 similarly DOC7 got merged into DOC6.

2.      Predefined Document Change

This feature is used to set a particular document type for all those documents whose confidence value is less than the provided threshold values. The UNKNOWN type documents will be kept as it is and will not be affected.

For example: Suppose after the algorithm documents are classified with provided confidence values. The value of threshold is 50 and document type to be set is DOC_TYPE_DEFAULT:

DOC1 – DOC_TYPE_1 : 40

DOC2 – UNKNOWN : 0

DOC3 – DOC_TYPE_1 : 30

DOC4 – DOC_TYPE_4 : 70

DOC6 – DOC_TYPE_DEFAULT : 85

DOC7 – DOC_TYPE_DEFAULT : 30

Then the resulting classification for this feature will be as follows:

DOC1 – DOC_TYPE_DEFAULT

DOC2 – UNKNOWN

DOC3 – DOC_TYPE_DEFAULT

DOC4 – DOC_TYPE_4

DOC6 – DOC_TYPE_DEFAULT

DOC7 – DOC_TYPE_DEFAULT

 

3.     Change Unknown Document Type

This feature is used to set a particular document type to all the UNKNOWN type documents. User will provide the document type on the UI which is existing in the batch class and all the UNKNOWN documents are classified as the provided document type.

For example: Suppose after the algorithm document are classified as follows. The document type to be set is DOC_TYPE_DEFAULT then the classification is as follows:

DOC1 – UNKNOWN

DOC2 – UNKNOWN

DOC3 – DOC_TYPE_1

DOC4 – UNKNOWN

DOC5 – UNKNOWN

DOC6 – DOC_TYPE_2

DOC7 – UNKNOWN

DOC8 – DOC_TYPE_2

Then the resulting classification for this feature will be as follows:

DOC1 – DOC_TYPE_DEFAULT

DOC2 – DOC_TYPE_DEFAULT

DOC3 – DOC_TYPE_1

DOC4 – DOC_TYPE_DEFAULT

DOC5 – DOC_TYPE_DEFAULT

DOC6 – DOC_TYPE_2

DOC7 – DOC_TYPE_DEFAULT

DOC8 – DOC_TYPE_2

 

4.      DA Delete Document First Page

This feature is used to delete the first page of all the classified documents. This is used to remove the separator sheets or barcode pages if any in any batch. If switch is ON then the first page of all the documents will be removed. If a document has only one page then it will not be removed.

 

5.      Regex Classification

The functionality of this feature is as follows:

When User are performing classification through “Barcode Classification” or the results of “KV_PAGE_PROCESS” plugin, then Application will check all the page level field’s values with the regex mentioned for the document classified as “UNKNOWN”.

For the page whose page level field has a value matching to regular expression, DA plugin will create a new document with the type same as the one provided in “Default Regex Document Type” property.

This way DA plugin will create a new document for each page which has similar value as that of regex provided.

Following is an example for the same:

The regex provided is as follows: a*

Default document type provided is: Regex Doc Type

Documents are classified as listed below before execution of this feature:

1.      DOC1 – type: Doc Type 1

2.      DOC2 – type: Doc Type 2

3.      DOC3 – type: UNKNOWN

4.      DOC4 – type: Doc Type 1

In this scenario DA plugin will work on only DOC3 since only this document is of “UNKNOWN” type.

Let’s assume documents DOC1 and DOC2 has pages PG0, PG1 and PG2 among them and DOC3 has four pages with below described page level field values:

Page Page level field values
PG3 i. 123ii. Invoiceiii. US
PG4 i. 990ii. Invoiceiii. Checklist
PG5 i. 789ii. Documentiii. Aaa
PG6 i. 456ii. Applicationiii. Checklist

 

Since among all the pages: PG3, PG4, PG5 and PG6, PG3 and PG4 has no page level field which matches the regex provided(a*) therefore these pages will remain in document DOC3 with type “UNKNOWN”

In PG5 the third page level field(Aaa) matches with the regex, therefore this page will be put into a new document DOC4 of type “Regex Doc Type”. For PG6 also since the second page level field(Application) matches with the regex, therefore this page will also be put into a new document with identifier DOC5 and type “Regex Doc Type”.

The final classified documents will be:

1.      DOC1 – type: Doc Type 1

2.      DOC2 – type: Doc Type 2

3.      DOC3 – type: UNKNOWN (pages: PG3, PG4)

4.      DOC4 – type: Regex Doc Type (pages: PG5)

5.      DOC5 – type: Regex Doc Type (pages: PG6)

6.      DOC6 – type: Doc Type 1

 

Advance Document Assembler Algorithm

Below described is a detail explanation of the Advanced Document Assembler algorithm with use cases:
1.      First Page:

For the case of first pages what algorithm aim to do is that when algorithm encounters the first FIRST_PAGE then algorithm will create a new document and add this page to this document. For other pages that algorithm encounter further, algorithm will check their page type and accordingly irrespective of their page types, algorithm will merge them if algorithm find some matching alternate values.

Test cases:

Scenario 1:

If algorithm have three pages as input and they all are classified as follows:

1.  A_FIRST_PAGE

2.  A_FIRST_PAGE

3.  A_FIRST_PAGE

Three individual documents are created.

Scenario 2:

If algorithm have three pages as input and they are classified as follows:

1.  A_FIRST_PAGE

2.  B_FIRST_PAGE

3.  C_FIRST_PAGE

  • Then on encountering the A_FIRST_PAGE algorithm will make a new document type.
  • For B_FIRST_PAGE algorithm will check the alternate values for A_MIDDLE_PAGE and A_LAST_PAGE in page B_FIRST_PAGE, whichever is higher algorithm choose that page, if the confidence for it is comparable to confidence threshold of respective page type then these will be merged otherwise first document will be closed and a new document is created.
  • For C_FIRST_PAGE the same scenario is followed.
  • Also if the confidence of the current page classified as X_FIRST_PAGE is more than or equal to First page Confidence Threshold(F_P_C_T) then a new document will be created and this current page will be entered into this document.

2.      Last Page:

In the case of last pages algorithm will merge them if the confidence threshold matching algorithm is successful for merge scenario.

Test Cases:

Scenario 1:

If algorithm have three pages as input and they all are classified as follows:

1.  A_LAST_PAGE

2. A_LAST_PAGE

3.  A_LAST_PAGE

  • Suppose there are no open documents present. Then on encountering the first A_LAST_PAGE algorithm will put it in the new document.
  • For the second page since it has the same type i.e. A_LAST_PAGE algorithm will merge this page in the open document type.
  • Similarly for the third page algorithm will merge it into the document and close the document.

Scenario 2:

If algorithm have three pages as input and they all are classified as follows:

1. A_LAST_PAGE

2. B_LAST_PAGE

3.  C_LAST_PAGE

  • The first page will be put in a new document. Now the next page is picked up which is B_LAST_PAGE, since the two pages are of two different types algorithm will look into the top five alternate values for this page and get the confidence for A_LAST_PAGE (since the document has already a last page therefore algorithm will not consider the middle page confidence in the alternate values) and process them with the confidence matching algorithm. If the algorithm is positive for merge then algorithm merge these two pages in one document. If the algorithm is negative for merge then first document will be closed and a new document will be created with page B_LAST_PAGE placed in it.

Similarly in the case of C_LAST_PAGE also algorithm will follow the same procedure.

  • Similar will be the procedure in case of all the other scenarios and combinations like A_LAST_PAGE, B_LAST_PAGE, A_LAST_PAGE etc.

3.      Middle Page:

For the case of middle pages, algorithm will perform the same procedure for merging the pages in the same document. But from alternate values algorithm will consider both middle pages and last pages for higher confidence until a last page is added to the open document.

The scenario for merging is the same according to the confidence threshold matching algorithm.

This is our analysis for the Advance Document Assembly algorithm.

 

Confidence Threshold Matching Algorithm

According to this algorithm will decide if the pages have to be merged or not.

Consider the confidence of a page as “X” and the page algorithm are trying to match is middle page with confidence “Y” then algorithm will check if

(X-Y) < M_P_C_T

Then algorithm will merge the pages into one document only. Otherwise algorithm do not merge, algorithm will close open document and create a new document with this page in it.

Similarly if algorithm is matching the last pages then for that if confidence of last page from alternate values is “Y” then algorithm will check if

(X-Y) < L_P_C_T

If this is true then algorithm will merge them otherwise algorithm does not merge them.

Assumptions

Algorithms have following assumptions and requirements for this algorithm:

  • For alternate values, algorithm will consider only top 5 alternate values. {where 5 is configured/defined at the admin UI. }
  • Algorithm will have three confidence thresholds because all the two types of pages have different weightages:

1.      First_Page_Confidence_Threshold (F_P_C_T)

2.      Middle_Page_Confidence_Threshold (M_P_C_T)

3.      Last_Page_Confidence_Threshold (L_P_C_T)

 

DA Plugin features

There are also some new properties in the plugin which will enhance the plugin and reduce the overhead of script execution. With lesser requirement of the scripts user can reduce the overhead of creating the scripts and also time delay in execution of scripts can be removed. Below provided table shows the new added properties and their relative values.

S. No DA Plugin features Default Values Execution
1 DA Merge Unknown Document SwitchOFF Search Image and Automatic Classification
2 Advanced Document Assembler Algorithm ON Search and Image Classification
3 Predefined Document Change NA Search and Image Classification
4 Change Unknown Document Type NA ALL
5 Regex Classification NA Barcode Classification
6 DA Delete Document First Page Switch OFF ALL

UI Configuration

Document Assembler plugin can be configuring from at following UI:

Configurable property Value options Description
DA Barcode confidence 0-100 This field is used to specify the barcode confidence. The confidence value for classified document type in Barcode classification is this value.
DA Rule First-middle-last Page 0-100 This field is used to specify the confidence for first, middle and last page document.
DA Rule First Page 0-100 This field is used to specify the confidence for first page document.
DA Rule Middle Page 0-100 This field is used to specify the confidence for middle page document
DA Rule Last Page 0-100 This field is used to specify the confidence for last page document.
DA Rule First-last Page 0-100 This field is used to specify the confidence for first and last page document.
DA Rule First-middle Page 0-100 This field is used to specify the confidence for first and middle page document.
DA Rule Middle-last Page 0-100 This field is used to specify the confidence for middle and last page document.
DA Classification Type
  • SearchClassification
  • BarcodeClassification
  • ImageClassification
  • OneDocumentClassification
  • AutomaticClassification
This value decides the document classification type to be used for classification.
DA Merge Unknown Document Switch
  • ON
  • OFF
This value decides whether the unknown document to be merged with pre classified document or not.
DA Delete Document First Page Switch
  • ON
  • OFF
This value decides whether First page of document will be deleted if document has more than one page.
Advanced DA Switch
  • ON
  • OFF
This value decides whether to run Old or Advanced DA algorithm.
DA First Page Confidence Threshold 0-100 This field is used in Advanced DA to specify the confidence threshold for First page for classification into document.
DA Middle Page Confidence Threshold 0-100 This field is used in Advanced DA to specify the confidence threshold for Middle page for classification into document.
DA Last Page Confidence Threshold 0-100 This field is used in Advanced DA to specify the confidence threshold for Last page for classification into document.
Predefined Document Type Pre-defined Document Type This value specifies the Document Type in which Document will be changed based on Predefined Document Confidence Threshold.
Predefined Document Confidence Threshold 0-100 This field is used specify the threshold confidence below which Document Type will be classified into Predefined Document Type.
Change Unknown Document Type Switch
  • ON
  • OFF
This value decides whether the Document classified into Unknown will be changed to some pre-defined Document Type.
Change Unknown Document To Document Type Pre-defined Document Type This value specifies the Document Type in which Document will be changed based on Change Unknown Document Type switch.
Regex Classification Switch
  • ON
  • OFF
This value decides whether KV Page process and Barcode Reader results will be compared with Regex Classification pattern. If there is match then Document Type is changed to one specified in Regex Classification Default Document Type.
Regex Classification Pattern Regex pattern This value specifies the regex pattern with which the value of KV Page process or Barcode Reader will be compared.
Regex Classification Default Document Type Pre-defined Document Type This value specifies the Document Type in which Document will be changed based on Regex Classification switch.

Steps of execution

    • This plug-in works in the document assembler phase of the Batch Class workflow.
    • The plug-in use the page level field results of Page Processing module as an input and generates the classified documents as an output.
    • If DA Merge Unknown Document Switch is ON, the plugin merges the “Unknown” type documents into previous classified documents.

Dependency

The plugin assumes the page processing plugins for respective classification types have been executed and page level fields for each image is populated. Document Assembler works on the page level field values for each page and classifies pages into documents.

Troubleshooting

Following are few common error messages received due to mal-functioning of the plugin:

S no. Error message Possible root cause
1 Invalid format of page level fields. DocFieldType found for {Document Assembler Classification Type} classification is null.
  • Page level fields weren’t present on the batch.
  • Page processing module didn’t work properly.
2 DocumentType name is not found in the data base for the page type name Barcode decoded value is not found as document type in the Ephesoft Application database.
3 No Document type defined for batch instance Batch class doesn’t have document type for classification.
4 Invalid integer for barcode confidence score in properties file. Invalid value used at “DA Barcode confidence” at Ephesoft Admin Screen Configuration.

 

 

<Back| 4.0.0.0 Release Documentation

Was this article helpful to you?

Engineering

Comments are closed.