* Page Processing

For search classification the calculated page confidence values are weighted by the First Page Confidence Score Value, middle Page Confidence Score Value & last Page Confidence Score Value configured in the Search_Classification plugin. This happens before Document Assembly.

 

* Regular Document Assembly

1. Separation

For the batch:

  • look at the highest confidence value for each page
  • if it is a first page then start a new document

 

2. Classification

For each document in the batch:

  • Look at all the page confidence values for all pages
    – calculate the weighted confidence for each doc type found within the document
    – weighting factors:

    – if first, middle & last pages found in order for same doctype apply first-middle-last page weights
    – else if first & last pages found in order for same doctype apply first-last page weights
    – else if first & middle pages found in order for same doctype apply first-middle page weights
    – else if middle & last pages found in order for same doctype apply middle-last page weights
    – else if just a first page apply first-page  page weights
    – else if just a middle page apply middle-page page weights
    – else if just a last page apply last-page  page weights

     

** What weighting factors apply?
For the doc confidence score, the following is applied not for separation, but to generate doc classification confidence score:

DA Rule First-middle-last Page:        100
DA Rule First Page:        50
DA Rule Middle Page:        25
DA Rule Last Page:        50
DA Rule First-last Page:        75
DA Rule First-middle Page:        50
DA Rule Middle-last Page:        50

– the highest weighted value is used as the doc type for the document

 

* Advanced Document Assembly

1. Separation

For the batch:

  • Forward and reverse page level look-aheads and look-behinds to all alternate values are applied to a proprietary algorithm. Decision making is based on every permutation of pages and alternative value information in the xml.

The algorithms rely on weightings from DOCUMENT_ASSEMBLER:
DA First Page Confidence Threshold:        50
DA Middle Page Confidence Threshold:        15
DA Last Page Confidence Threshold:        10
2. Classification

For each document in the batch:

  • Look at all the page confidence values for all pages
    – calculate the weighted confidence for each doc type found within the document
    – weighting factors:

    – if first, middle & last pages found in order for same doctype apply first-middle-last page weights
    – else if first & last pages found in order for same doctype apply first-last page weights
    – else if first & middle pages found in order for same doctype apply first-middle page weights
    – else if middle & last pages found in order for same doctype apply middle-last page weights
    – else if just a first page apply first-page  page weights
    – else if just a middle page apply middle-page page weights
    – else if just a last page apply last-page  page weights

     

** What weighting factors apply?
For the doc confidence score, the following is applied not for separation, but to generate doc classification confidence score:

DA Rule First-middle-last Page:        100
DA Rule First Page:        50
DA Rule Middle Page:        25
DA Rule Last Page:        50
DA Rule First-last Page:        75
DA Rule First-middle Page:        50
DA Rule Middle-last Page:        50

– the highest weighted value is used as the doc type for the document

 

Was this article helpful to you?

Walter Lee