Topic/Category: Table Extraction

KBID: KB00007775

Issue:

Customer was having trouble extracting data from a table using the Column Header and Column Coordinates

Root Cause:

Column Headers are highly dependant on the recognition of the Column Header Patter in the OCR. Variations in the OCR can cause the table or column not to be extracted properly or complete rows to be skipped.

Example: 

If you are looking for “Part Number” but the OCR value for the Column name you see is “Pert Number” the table will not extract or the column will not extract.

Column Coordinates will take the values identified base on a zonal pattern and extract the contents within. If your zone coordinates are not defined properly you will possibly get values pertaining to the column next to it as well.

Solution: 

To resolve the issues regrading the recognition of the Column Headers you need to account for variances in the OCR.
Here are some ideas:

  1. Try using a different image compression in your import settings. (Group4 vs. LZW)
  2. Try a higher DPI for quality retention of the image during batch processing (300 – 600 DPI)
  3. Since the Column Header Pattern is a Regex you can try to account for variances in the OCR by taking a specific Column Header Pattern like: Part Numberand instead use a more generic Regex like P[A-z0-9\s]{7}ber this find any variation of Alphanumeric values that will use start with a P and end with

To Resolve any issues with the Column Coordinates, you need to simply adjust your zonal areas so they fit and account for variations in the images. Variation include Skewed coordinates, Changes in resolutions and overall image size. To ensure best results you must try to standardize your input images and have a minimum quality requirement (e.g. Resolution: 2550×3300, DPI: 300)

 

< BackKB Main Page

Was this article helpful to you?

wikiadmin

Comments are closed.