Fuzzy DB Extraction Plugin:

FuzzyDB plugin is used to match incoming documents to records in the database.

Example 1: Insurance company is receiving faxes from its customers. Insurance Company already has a database with all the customers containing their name, address, phone number, etc. Ephesoft can be configured to analyse all the words in the incoming documents and find the matching database record automatically. No extraction is needed.

Example 2: Ephesoft administrator can configure a Accounting Vendor database to automatically capture Vedor name and Vendor ID from incoming invoices by simply mapping the document to database table and index fields fr the document to columns in the database table. Even if the Vendor info (such as name, address, phone number) on the invoices are partially OCRed, plugin will find teh matching vendor from the database and update the fields in the document.

This plug-in involves creation of search engine based indexing and extracting document level field value based on fuzzy match of hOCR content against index.

Ephesoft can also ignore certain words when searching for a FuzzyDB match, for example if your own company name is in your vendors database this may throw up false positives. To add words to your ignore list go to your \Ephesoft\Application\WEB-INF\classes\META-INF\dcma-fuzzydb\fuzzy-db.properties and you’ll find the fuzzydb.ignore_list= where you can add your words

 

General Info

 

Connection String (case sensitive)

  • For mssql (jtds driver) = jdbc:jtds:sqlserver://SERVERNAME_OR_IP;databaseName=ephesoft
Note: Jtds connection string is w.r.t Sql Server 2008 R2 but should also work for Sql Server 2005 as well. All our testing has been done on Sql Server 2008 R2.
  • For MySQL = jdbc:sqlserver://SERVERNAME_OR_IP;database= ephesoft
  • Ephesoft does have support for views in place of tables for fuzzy db. Testing for both Fuzzy Search as well as Fuzzy Extraction has been done.
  • for Oracle (Available in 3.0.3.7, 3.1.x+; if oracle database driver selection available) = jdbc:oracle:thin:@<Server-address>:<port-number>:<SID Name>;dbName=<Database-Name>

Tested/Certified Server Versions

  • MySQL 5.1/5.5
  • Up to MSSQL 2014

UPDATE

DATA TYPES

Java MySQL Limits MSSQL Limits
STRING VARCHAR, TEXT, CHAR VARCHAR:1-255 chars; TEXT: Maximum length of 65535 chars; CHAR:1-255 chars VARCHAR, TEXT, CHAR VARCHAR:8000 Max chars; TEXT: Max 2^31 – 1 chars; CHAR:8000 Max chars
DATE DATETIME, TIMESTAMP DATETIME
LONG BIGINT The signed range is –9223372036854775808 to 9223372036854775807. The unsigned range is 0 to 18446744073709551615 BIGINT Range is 2^63 through 2^63-1
DOUBLE DOUBLE Range is -1.7976931348623157E+308 to -2.2250738585072014E-308 DOUBLE Range from -3.40E + 38 through 3.40E + 38
INTEGER INT, MEDIUMINT INT
FLOAT FLOAT Ranges are –3.402823466E+38 to –1.175494351E-38, 0 and 1.175494351E-38 to 3.402823466E+38 FLOAT -1.79E + 308 through -2.23E – 308, 0 and 2.23E + 308 through 1.79E + 308
BIGDECIMAL DECIMAL The maximum range of DECIMAL values is the same as for DOUBLE, but the actual range for a given DECIMAL column may be constrained by the choice of Length and Decimals. DECIMAL -10^38 +1 through 10^38 –1
BOOLEAN BIT A value of zero is considered false. Non-zero values are considered true. BIT Integer data with either a 1 or 0 value, where 0 value is considered as false.

 

Steps

  1. User select the batch class in the admin module and navigates to Fuzzy plugin settings page and turn on the FuzzyDB Extraction Switch.
    FuzzyDB-Plugin-Config
  2. User reviews the configuration settings for Fuzzy DB Extraction Configuration under Document Types. User selects FuzzyDB connection, TableName and Row ID. RowID must be mapped to a unique key. Index level fields can be mapped with table columns.
    FuzzyDB-DB-mapping
  3. User creates indexes whenever he click on “learn” button. Note that index can be only generated once the mapping is defined.
    1. Lucene indexing is generated against all DB records belonging to all document types (against which DB table is already specified) for current batch class. User create indexes for each row for all mapped columns.
    2. Indexes are built on a string which is the combined text of all fields belonging to various columns of DB table i.e. customer name, address, state, telephone, etc.
    3. Separate index directories are created to store indexs for different document types. The hierarchy used for storing index files against each document level field is: Fuzzy_db_index/<batch_class_id>/<document_type> Index creation is typically a time consuming process, so a static progress bar (no dynamic ETA calculation) will be shown to user while index is being created at backend.

Preconditions/Assumptions

  1. Changes in revision of batch does not impact the configuration of this plugin. Changes is version of batch will require admin to create the configuration again for the plugin.
  2. All the mapping details of document level fields with database table columns are stored with the plugin details.
  3. If user add new document type in the batch and no plugin configuration is defined for the added batch class, no extraction will happen.
  4. If user deletes the existing document type and the mapping still exists, plugin will do nothing because no batch documents will be processed.
  5. If user adds/removes/edits the document level field then admin has to make appropriate adjustments to mappings.
  6. The plugin does not assume existence of any other key value extraction plugin, scripting or regular expression extraction plugin. It operates independently of other plugins.
  7. The index will only be generated on the columns that are mapped.
  8. We assume that one table maps to one document type. The database containing all tables for different document types is set as part of plugin settings

Execution phase

Steps

  1. Work-flow invokes the “fuzzy DB extraction” plug-in.
  2. Plug-in uses an hOCR file and query compares the hOCR based query against indexes on DB table rows.
  3. Lucene returns the matching record (the one with the highest confidence score) i.e row id. If the score is greater than the threshold then the corresponding values will be stored in document level fields values in batch xml file.
  4. Row id of DB table is returned by Lucene search engine, and will be used to fetch further content from DB. Data fetched from DB will be used to populate document level fields in batch.xml. All the document level fields populated from this plug-in will need to be validated in the validation stage of work-flow (as it is fuzzy matching)
  5. The matching is based on only those fields that were involved in mapping (and index creation).

Preconditions/Assumptions

  1. Lucene is used because if we use SQL query for every word in the html file, it would be too slow and furthermore Lucene will provide results even if the OCR is not perfect on every character in the word.
  2. It is possible that match might result in no result. In such cases, any document level field is not populated.
  3. It is possible that match might result in multiple results. In such cases, the one with the highest confidence score entry will be used to populate document level fields.
  4. The plug-in does not involve manual intervention and will be an automated step.

Inputs: Single page tiffs

Outputs: XML files, PDF, multi-page TIFFs

Was this article helpful to you?

wikiadmin

Comments are closed.