A well-formed set of HOCR xml files which are placed in a hierarchical structure such as: Batch Class > Document type > Page type are used for the purpose of registering few standard HOCR xml documents with Lucene search engine. This process is called learning because it is like feeding the xml files into Lucene’s memory by creating Lucene indexes. HOCR files in batch instance are compared with these memorized indexes to find a best match and classify the pages. Note that learning is a one-time-process. Learning makes classification process fast as no index needs to be generated at runtime to classify the documents.

Steps of learning

1. First create document type that Ephesoft has to recognize. Suppose user has created Application-Checklist document type in batch class BC1 and save this document type. It creates the necessary folders where tiff files to be learned can be placed. In this case folder will be created under “Ephesoft-install-dir\SharedFolders\Batch-Class-Id\lucene-search-clasification-sample” folder. Following three subfolders will be created in this case:

  1. Application-Checklist_First_Page
  2. Application-Checklist_Last_Page
  3. Application-Checklist_Middle_Page

2. Now select a document type and upload the documents having extensions .tif/.tiff/.pdf. On uploading the documents, these documents will be automatically learned. These documents can be single page/multipage.

3. If uploaded document is a single page document, single page must be copied in Application-Checklist_First_Page.

4. If uploaded document is a multipage document, first and last page of the document will be copied in the Application-Checklist_First_Page and Application-Checklist_Last_Page respectively and all other pages of the document will be copied in Application-Checklist_Middle_Page.

Learn File(s)

This feature learns the documents present in the folders of document type.

On learning, following action happens:

1. Hocr files will be created in the folder “Ephesoft-install-dir\SharedFolders\Batch-Class-Id\lucene-search-clasification-sample” for lucene learning.

2. Thumbnails will be created in the folder “Ephesoft-install-dir\SharedFolders\Batch-Class-Id\ image-classification-sample” for image classification.

3. Indexes will be created in the folder “Ephesoft-install-dir\SharedFolders\Batch-Class-Id\ learn-index” for index learning.

View Learn File(s)

User can navigate using keyboard to see learn files result for different document types.

User can view learned files of a document type on the UI.

Select any document type and click on ‘View Learn File(s)’ button, following UI will be presented:

The Result page will have the following given columns:

Column Name Type of value Value options Description
File Name String NA It represents the uploaded file name.
Page Type String
  • LAST
It specifies this file is learned as first, last or middle page.
Image Classification Boolean
  • True
  • False
It specifies whether thumbnail created or not. It’s value will be true If thumbnail created else false.
Lucene Learning Boolean
  • True
  • False
It specifies whether hocr files created or not. It’s value will be true if hocr file created else false.

The Ephesoft software is now ready and has learned the document type of Application-Checklist.


