Fuzzy DB Design Revamp

Fuzzy DB Plugin is used to search and extract data from any configured database. Extraction of data can be performed on both Document as well as Field level, or the previously extracted value of a document level field via another plugin. The Plugin performs lookup based on the indexes mapped with the database columns and populates the relative data in the fields.

In the existing system, all the configurations are done at document level but the mappings are stored in database at plugin level which resulted in many issues:

  • Mappings were not copied when a document type is copied.
  • Mappings were not imported/exported when a batch class or document type is imported/exported.
  • On removing fuzzy DB plugin from the workflow, all mappings were removed.
  • During upgrade from Batch class level to Document level, the properties were not migrating properly.

In order to resolve all the issues, the Fuzzy DB design is completely revamped. All the Fuzzy DB configurations are now moved from Plugin level to Document level. And along with document level, field level search and extraction is also supported in the new design.

Video:

 

Both Document and Field Fuzzy can be configured for a Document Type.

Configuring Database

  1. Go to System Configuration from the Admin section.F:\Enterprise\Product documentation 4060\images\fuzzy\1.png
  2. Go to Connection Manager, and click on Add.
    Give Connection and Database details to configure the database.F:\Enterprise\Product documentation 4060\images\fuzzy\2.png
  3. You can configure multiple databases here.

Document Fuzzy (Document level extraction)

In Document Fuzzy, only one database can be configured per Document type.

In case no Search Field is selected, OCR content search is performed.

If search field is selected, value extracted from previous plugin for the selected index field act as a search string for fetching value.

In case no match is found (in Lucene indexes) for value extracted from already selected field, OCR search will be performed on the basis of value of HOCR search switch.

Document Fuzzy works with or without any preceding plugin configuration.

  1. In the Batch Class view, select the Document Type.
  2. Expand Fuzzy DB Extraction Configuration and click on Document Fuzzy.
  3. Select a Connection, Database Table Name and Primary Key from the dropdown lists.
  4. Click on Add to map index fields with database columns.
    Mark the checkbox Is Searchable, if you want to look up in the database based on the configured field. Instead of the entire table, the search will be applied only on the column.
  5. Perform Additional Mappings and click on Apply.F:\Enterprise\Product documentation 4060\images\fuzzy\3.png
    Basic Configuration:
    Enabled – To enable Document Fuzzy configurations.

    Confidence Threshold – Minimum Confidence defined for the search.

    Weight – Value ranges from 0 to 1, act as a multiplying factor for the confidence computed.

    Include Pages – Select First Page, Last Page or All Pages (HOCR content) for search.

    Ignore Word List – Enter semi-colon separated values which you want to ignore during search.


    Field Based Search:

    HOCR Search Switch – Enable, if you want to search based on HOCR content after index fields.

    Search Column List – Select the index field to search with.

    Once all the configurations are done, click on Learn Database to Generate Lucene Indexes.

 

Test Extraction

You can test the extraction results on the Document Type grid itself on a PDF or TIFF image.

In the below illustration, the value of Invoice field is extracted from both KV Extraction and Fuzzy DB but the confidence level of KV extraction was 100 that’s why it is populated. And for State, only Fuzzy DB Extraction was configured that’s why it is extracted from Fuzzy with 46.76 confidence.

F:\Enterprise\Product documentation 4060\images\fuzzy\4.png


Fuzzy Search and Extraction

On executing the batch, the document will appear on the Validation Screen wherein user can search via HOCR content mapped with the index field.

In the below illustration, user searched in the Document for the field value CA in the database. The extraction results are populated in a popup window. User can select the row and click on OK to populate the details in the field.

If you want to fetch all the records from the table, enter * in the search box.

F:\Enterprise\Product documentation 4060\images\fuzzy\5.png

Field Fuzzy (Field level Extraction)

Multiple rules can be defined; a Group is created for each rule in the configurations.

A field can be mapped with multiple databases.

Group: Connection: Table: Fields

Note that it is not possible to have two connections mapped with same table and index.

Assumption – Connections (Databases) are already configured.

Another Plugin is also configured on the Documents Type.

Configure Field Fuzzy:

  1. In the Batch Class view, select the Document Type.
  2. Expand Fuzzy DB Extraction Configuration and click on Field Fuzzy.
  3. Create a Group and select Connection, Database Table Name and Primary Key from the dropdown lists.
  4. Click on Add to map index fields with database columns.
    Mark the checkbox Is Searchable, if you want to look up in the database based on the configured field. Instead of the entire table, the search will be applied only on the column.
  5. Perform Additional Mappings and click on Apply Mappings.F:\Enterprise\Product documentation 4060\images\fuzzy\6.pngAdditional Parameters MappingEnabled – To enable Field Fuzzy configurations.

    Confidence Threshold – Minimum Confidence defined for the search.

    Weight – Value ranges from 0 to 1, act as a multiplying factor for the confidence computed.

    Ignore Word List – Enter semi-column separated values which you want to ignore during search.

  6. Configure multiple Groups (rules) as per your requirement.
    Enable the rules which you want to apply on the Document.F:\Enterprise\Product documentation 4060\images\fuzzy\7.png

 

Test Extraction

You can test the extraction results on the Document Type grid itself on a PDF or TIFF image.

In the below illustration, the value of field f2 is extracted from both Regex Extraction and Fuzzy DB but the confidence level of Regex extraction was more (85) that’s why it is populated. And for other fields, only Fuzzy DB Extraction was configured that’s why it is extracted from Fuzzy. Field f5 was not mapped, therefore no result is extracted.

F:\Enterprise\Product documentation 4060\images\fuzzy\8.png

 

Fuzzy Search and Extraction

On executing the batch, the document will appear on the Validation Screen wherein user can search on index level.

In the below illustration, user searched in the field f3 value multipage in the database. The extraction results are populated in a popup window. Here, one record was found in the Group1 and two records in Group2. User can select the row and click on OK to populate the details in the field.

If you want to fetch all the records from the table, enter * in the search box.

F:\Enterprise\Product documentation 4060\images\fuzzy\9.png

User can select multiple rows and can populate the extracted values in the fields.

Note that two rows from same Group cannot be selected.

Case 1: If multiple rows are selected, application will check for record with more confidence and populate the values in the field.

F:\Enterprise\Product documentation 4060\images\fuzzy\10.png

Case 2: If multiple rows are selected with same confidence, the application will pick the values of the record with more Weight.

F:\Enterprise\Product documentation 4060\images\fuzzy\11.png

Case 3: If confidence and weight both are same for the selected rows; the application will pick on First Come First Serve basis i.e. the first record in the extraction list.

 

The values of the extracted values are populated in the fields.

F:\Enterprise\Product documentation 4060\images\fuzzy\12.png

Case 4: In case both Document Fuzzy and Field fuzzy are configured, and the same index field exist in both; the weight is populated on the basis of higher weighted confidence.

Note:

In Batch.xml, new tags are added in the <DocumentLevelField> tag reflecting the extraction type i.e. Document or Field Fuzzy.

Case 1: Data extracted via Document Fuzzy

<DocumentLevelField>

<ExtractionName>Fuzzy DB</ExtractionName>

<FuzzyExtractionType>

<Type>Document</Type>

</FuzzyExtractionType>

<AlternateValues>

</AlternateValues>

</DocumentLevelField>

Case 2: Data extracted via Field Fuzzy

<DocumentLevelField>

<ExtractionName>Fuzzy DB</ExtractionName>

<FuzzyExtractionType>

<Type>Group</Type>

<GroupName>group1</GroupName>

</FuzzyExtractionType>

<AlternateValues>

</AlternateValues>

</DocumentLevelField>

Case 3: Data extracted via another plugin (KV Extraction)

<DocumentLevelField>

<ExtractionName>KV Extraction</ExtractionName>

<AlternateValues>

</AlternateValues>

</DocumentLevelField>

 

 

Was this article helpful to you?

wikiadmin

Comments are closed.