Search Classification Plugin

Overview

This plugins is responsible for classifying the Ephesoft documents using lucene based indexing for batch class.

This plugin is working on the two stages for classification of document:

  • Learning: Learning process is done for generating indexes from documents. Generated indexes will be used as classifying the document. For further information of learning, please refer the document “Learning document”.
  • Classification: While classification a document using search classification plugin, learnt data is used as reference data for classification of document. While classification a document type, this plugin use the extracted HOCR content from the image and verifying the HOCR content to the learnt data in previous stage.

Using this plugin HOCR content should be generated in HOCR Generation plugin like “Recostar HOCR” and “Tesseract HOCR”.

Configuration

Configurable Properties

Following are the list of configurable properties for the plugin:

 

Configurable property Type of value Value options Description
Lucene Valid Extensions String Ex: html, xml These are the valid extension of the input file for classification document type from specified file format.
Default value is html, xml
Lucene Min Term Frequency Integer NA The frequency below which terms will be ignored in the source document.
Lucene Min Document Frequency Integer NA Sets the frequency at which words will be ignored which does not occur in at least this many documents.
Lucene Min Word Length Integer NA The minimum word length below which words will be ignored from the HOCR content.
Lucene Min Query Terms Integer NA The minimum number of query terms that will be included in any generated query.
Lucene Top Level Field String NA This property is used to configure default field for query terms.
Lucene No Of Pages Integer NA This property is used to specify number of documents to be returned in a query search.
Lucene Index Fields String Ex: summary This property is used as index field for searching document type using lucene.
Lucene Stop Words String Ex: name; title This property is used to ignoring the word while classification of document.
Search Classification Switch String ON

OFF
This property is used for ON/OFF the search classification plugin.
Default is ON
Search Classification Max Results Integer NA The maximum number of results will be generated from query.
First Page Confidence Score Value Integer NA This property is used for updating confidence score on the basis of the first page type.
Middle Page Confidence Score Value Integer NA This property is used for updating confidence score on the basis of the middle page type.
Last Page Confidence Score Value Integer NA This property is used for updating confidence score on the basis of the last page type.

 

This is shown in the screen shot given below:

 

400px-3.1_SearchClassificationPlugin_10001

 

Steps of execution

  • This plug-in works in the page process phase of the application when all the import processing on the batch has been done and it’s ready to be page processing.
  • Learning should be done on the batch class before using this plugin.
  • The plug-in classifying the input images via lucene based indexing.
  • After all the work is done, it writes the information into batch.xml file for the document type being classified.

 

Dependency

This plugin is dependent on the HOCR Generation plugin like Recostar, Tesseract. This plugin takes the HOCR file generated from Recostar and Tesseract as an input for Search Classification Plugin.

 

Troubleshooting

Following are few common error messages received due to mal-functioning of the plugin:

 

S. No. Error message Possible root cause
1 No index files exist inside folder Learning is not done for the batch class.
2 Page Types not configured in Database. Invalid indexes present in the index data for the batch class.
3 CorruptIndexException while reading Index. Index data being corrupted in the index folder for the batch class.
4 IOException while reading Index Index data is unable to open due to get index file corruption or having lock on it.
5 No valid extensions are specified in resources Page contains invalid HOCR file for processing.
6 No pages found in batch XML. Pages tag not found the input batch.xml

 

 

Was this article helpful to you?

wikiadmin

Comments are closed.