Last Updated on
Search Classification Plugin
This plugins is responsible for classifying the Ephesoft documents using lucene based indexing for batch class.
This plugin is working on the two stages for classification of document:
- Learning: Learning process is done for generating indexes from documents. Generated indexes will be used as classifying the document. For further information of learning, please refer the document “Learning document”.
- Classification: While classification a document using search classification plugin, learnt data is used as reference data for classification of document. While classification a document type, this plugin use the extracted HOCR content from the image and verifying the HOCR content to the learnt data in previous stage.
Using this plugin HOCR content should be generated in HOCR Generation plugin like “Recostar HOCR” and “Tesseract HOCR”.
Following are the list of configurable properties for the plugin:
|Configurable property||Type of value||Value options||Description|
|Lucene Valid Extensions||String||Ex: html, xml||These are the valid extension of the input file for classification document type from specified file format.
Default value is html, xml
|Lucene Min Term Frequency||Integer||NA||The frequency below which terms will be ignored in the source document.|
|Lucene Min Document Frequency||Integer||NA||Sets the frequency at which words will be ignored which does not occur in at least this many documents.|
|Lucene Min Word Length||Integer||NA||The minimum word length below which words will be ignored from the HOCR content.|
|Lucene Min Query Terms||Integer||NA||The minimum number of query terms that will be included in any generated query.|
|Lucene Top Level Field||String||NA||This property is used to configure default field for query terms.|
|Lucene No Of Pages||Integer||NA||This property is used to specify number of documents to be returned in a query search.|
|Lucene Index Fields||String||Ex: summary||This property is used as index field for searching document type using lucene.|
|Lucene Stop Words||String||Ex: name; title||This property is used to ignoring the word while classification of document.|
|Search Classification Switch||String||ON
|This property is used for ON/OFF the search classification plugin.
Default is ON
|Search Classification Max Results||Integer||NA||The maximum number of results will be generated from query.|
|First Page Confidence Score Value||Integer||NA||This property is used for updating confidence score on the basis of the first page type.|
|Middle Page Confidence Score Value||Integer||NA||This property is used for updating confidence score on the basis of the middle page type.|
|Last Page Confidence Score Value||Integer||NA||This property is used for updating confidence score on the basis of the last page type.|
This is shown in the screen shot given below:
Steps of execution
- This plug-in works in the page process phase of the application when all the import processing on the batch has been done and it’s ready to be page processing.
- Learning should be done on the batch class before using this plugin.
- The plug-in classifying the input images via lucene based indexing.
- After all the work is done, it writes the information into batch.xml file for the document type being classified.
This plugin is dependent on the HOCR Generation plugin like Recostar, Tesseract. This plugin takes the HOCR file generated from Recostar and Tesseract as an input for Search Classification Plugin.
Following are few common error messages received due to mal-functioning of the plugin:
|S. No.||Error message||Possible root cause|
|1||No index files exist inside folder||Learning is not done for the batch class.|
|2||Page Types not configured in Database.||Invalid indexes present in the index data for the batch class.|
|3||CorruptIndexException while reading Index.||Index data being corrupted in the index folder for the batch class.|
|4||IOException while reading Index||Index data is unable to open due to get index file corruption or having lock on it.|
|5||No valid extensions are specified in resources||Page contains invalid HOCR file for processing.|
|6||No pages found in batch XML.||Pages tag not found the input batch.xml|