Overview

This plugins is responsible for classifying the Ephesoft documents using lucene based indexing for batch class.

This plugin is working on the two stages for classification of document:

  • Learning: Learning process is done generating indexes for documents. Generated indexes will be used as classifying the document. For further information of learning, please refer the document “Learning document”.
  • Classification: While classification a document using search classification plugin, learnt data is used as reference data for classification of document. While classification a document type, this plugin use the extracted HOCR content from the image and verifying the HOCR content to the learnt data in previous stage.

Using this plugin HOCR content should be generated in HOCR Generation plugin like “Recostart HOCR” and “Tesseract HOCR”.

Configuration

Configurable Properties

Following are the list of configurable properties for the plugin:

Configurable property Type of value Value options Description
Lucene Valid Extensions List of Values
  • xml
  • html
These are the valid extension of the input file for classification document type from specified file format.
Lucene Min Term Frequency Integer NA The frequency below which terms will be ignored in the source document.
Lucene Min Document Frequency Integer NA Sets the frequency at which words will be ignored which does not occur in at least this many documents.
Lucene Min Word Length Integer NA The minimum word length below which words will be ignored from the HOCR content.
Lucene Min Query Terms Integer NA The minimum number of query terms that will be included in any generated query.
Lucene Top Level Field String NA This property is used to configure default field for query terms.
Lucene No Of Pages Integer NA This property is used to specify number of documents to be returned in a query search.
Lucene Index Fields List of Values
  • title
  • summary
This property is used as index field for searching document type using lucene.
Lucene Stop Words List of Values
  • title
  • name
This property is used to ignoring the word while classification of document.
Search Classification Switch List of Values
  • ON
  • OFF
This property is used for ON/OFF the search Classification plugin.
Search Classification Max Results Integer NA The maximum number of results will be generated from query.
First Page Confidence Score Value Integer NA This property is used for updating confidence score on the basis of the first page type.
Middle Page Confidence Score Value Integer NA This property is used for updating confidence score on the basis of the middle page type.
Last Page Confidence Score Value Integer NA This property is used for updating confidence score on the basis of the last page type.

This is shown in the screen shot given below:

Steps of execution

    • This plug-in works in the page process phase of the application when all the import processing on the batch has been done and it’s ready to be page processing.
    • Learning should be done on the batch class before using this plugin.
    • The plug-in classifying the input images via lucene based indexing.
    • After all the work is done, it writes the information into batch.xml file for the document type being classified.

Troubleshooting

Following are few common error messages received due to mal-functioning of the plugin:

S no. Error message Possible root cause
1 No index files exist inside folder Learning is not done for the batch class.
2 Page Types not configured in Database. Invalid indexes present in the index data for the batch class.
3 CorruptIndexException while reading Index. Index data being corrupted in the index folder for the batch class.
4 IOException while reading Index Index data is unable to open due to get index file corruption or having lock on it.
5 No valid extensions are specified in resources Page contains invalid HOCR file for processing.
6 No pages found in batch XML. Pages tag not found the input batch.xml

 

<Back| 4.0.0.0 Release Documentation

Was this article helpful to you?

Engineering