Overview

‘Key-value pair’ based extraction plug-in will be responsible for extracting document level index field values based on relative location of ‘value’ against a specified key.

Plug-in execution for a batch instance will consist of following steps:

  • Extraction plug-in will iterate over all documents belonging to a batch instance and for every document, based on ‘document type’ it will fetch the list of document level index field. Every document level field will have association to multiple instances of extraction filters.
  • Pages (HOCR corresponding to page) belonging to that document will be parsed to generate an in-memory matrix having all word and corresponding co-ordinates, against each page (exact structure of matrix will be figured out while doing detailed design). Intent of generating this matrix is to improve performance, as this matrix will be generated once for all pages of a document and will be used for key / value pattern matching for all document level fields belonging to that document.
  • For every document level field, regular expression against ‘KEY’ will be searched against page level in-memory matrix already created in previous step
  • If regular expression based search for “KEY” returns one or more matched words, regular expression against ‘VALUE’ is evaluated against words located at specific relative location to the key (as governed by LOCATION attribute of extraction filter or by value zone created at KV Extraction screen). This is done for all occurrence of KEY on every page level matrix.
  • Zero or more match found against VALUE regular expression will be used to update batch.xml as Document level field value (and alternate values).
Inputs
  1. Document Pages and corresponding HOCR
  1. Document level fields
  1. Plug-in Configuration
  • Key (Regular Expression)
  • Value (Regular Expression)
  • Location (left, right, top, bottom, top left, top right, bottom left, bottom right)
Outputs
  1. Document level fields, values and alternative values updated in batch.xml.

KV Extraction

Admin user can define KV pair patterns using rectangular coordinates from Admin UI. Admin is provided with ‘Add’ and ‘Edit’ buttons to define and modify the KV patterns.

As soon as user will click on Add button, another UI will open up with following options with text boxes and labels displayed:

  • Key (regex or other pre-defined field)
  • Value (regex)
  • Fuzzy% (none, 10%, 20%, 30%)
  • Fetch (First, Last or All)
  • Page (First, Last or All)
  • Zone (All, Top, Left, Right, Middle, Bottom)
  • Weight (0 to 1; multiplied with confidence score value to calculate new confidence score)

Fuzzy%: User can specify following Fuzzy% value while defining key value pair:

  • None  Value should extract the values on the basis of exact match corresponding to the ‘Key’ value.
  • 10%  It should extract the key on basis that if 10% of the provided key does not match still it should return a value. For e.g. Key contain 10 characters and out of 10 characters one doesn’t match, still it should extract this word.
  • 20%  It should extract the key on basis that if 20% of the provided key does not match still it should return a value.
  • 30%  It should extract the key on basis that If 30% of the provided key does not match still it should return a value.

Fetch: User can specify following fetch value while defining key value pair:

  • First  to extract only first data from the value zone matching the value pattern specified.
  • Last  to extract only last data from the value zone matching the value pattern specified.
  • All  to extract only all data from the value zone matching the value pattern specified.

Page: User can specify following page value while defining key value pair:

  • ALL KV Extraction will be performed on all pages of the document.
  • FIRST KV Extraction will be performed on first page of the document.
  • LAST KV Extraction will be performed on last page of the document.

Zone: User can specify following Zone value while defining key value pair:

  • All On selection of ‘ALL’, Value should extract all combinations from entire Image.
  • Top On selection of this, Value should extract all combinations from top section of the Image.
  • Left On selection of this, Value should extract all combinations from Left section of the Image.
  • Right On selection of this, Value should extract all combinations from Right section of the Image.
  • Bottom On selection of this, Value should extract all combinations from Bottom section of the Image.
  • Middle On selection of this, Value should extract all combinations from Middle section of the Image.

On the basis of relative key and pattern coordinates, Document level field is extracted by KV extraction plugin.

Anchor Key Value

This functionality aims to utilize the result of previously extracted document level fields for extraction of other document level fields. User can use previously defined field as a key while defining key value field for some other document level field.

User can use previously defined field as a key while defining key value field for some other document level field.

  • There is a Use Existing Field For Key” checkbox present on KV extraction UI.
  • On checking this, a list will be populated with the names of document level fields that can be used as a key.

User can select any of those fields as key.

    • Note: Only those document level fields will be shown in drop down whose field order number is less than the field order number of the field for which key value pair is being defined.
  • While defining the key value pair for the document level field, user needs to capture key and value rectangles.
  • If Use Existing Field For Key” check box is selected, value of the field selected as key should be captured. This is required to calculate the xOffset and yOffset for the KV field.

Example: Suppose there are two document level fields State and City, and image contains following data:

State: CALIFORNIA

City: LA

While defining the key value field for City,

  • Use existing field for key should be checked.
  • State should be selected from the drop down for key pattern.
  • CALIFORNIA should be captured as key.
  • LA should be captured as a value.

Configuration

These are the following configurable property for KV extraction

Configurable property Type of value Value options Description
Regex Confidence Score String 0 to 100 Regex confidence score for key value extraction
KV Extraction switch Multi select
  • ON
  • OFF
KV extraction switch

KV Extraction

These are the following configurable property for KV extraction on UI

Configurable property Type of value Value options Description
Use existing Field For Key Checkbox NA Enable to use value of other field defined as key
Key String NA Regular expression pattern for the key
Value String NA Regular expression pattern for the pattern
Fuzzy% List of Values None, 10%, 20%, 30% Drop down with following possible values: none, 10%, 20%, 30%. Default value: none
Fetch List of Values ALL, FIRST, LAST Drop down with following possible values: ALL, FIRST, LAST. Default value: ALL
Page List of Values ALL, FIRST, LAST Drop down with following possible values: ALL, FIRST, LAST. Default value: ALL
Zone List of Values All, TOP, RIGHT, LEFT, MIDDLE, BOTTOM Drop down with following possible values: All, TOP, RIGHT, LEFT, MIDDLE, BOTTOM. Default value: ALL
Weight Integer 0 to 1 Field that can have values between 0 and 1. Its value is multiplied with confidence score value to calculate new confidence score during extraction.

Dependencies

Either one of the following must be on for KV extraction:

  • RECOSTAR_HOCR
  • TESSERACT_HOCR

Above specified plugins generate the hocr content for an image which is used by KV extraction for extraction.

FAQs

Question: Data not extracted or incorrect data extracted for the field for which existing field is being used as key.

Answer: Check for value extracted for the field which is used as a key for this field. If incorrect value is extracted, correct the key value pair defined for that field. This can be tested via Test KV button on the KV Extraction screen.

 

<Back| 4.0.0.0 Release Documentation

Was this article helpful to you?

Engineering