In our tests, we have found out that improving search classification is the best way to improve the classification results. Search Classification is much more accurate than Image Classification. Image classification is recommended for only single page document and when you have only small number of document types.
Key-Value Extraction is a great way to improve your search classification results. Below is a quick run through on how it works: During Page processing, you can configure Key Value extraction using the KV-Page Processing plugin. This plugin will extract Key-Value pairs and place them as Page level fields for each page. You can then write a script that will compare the key value to the classification results identified the document as Tax form but your page level extraction found W-2, then you can reclassify (rename) the document type in the batch xml file.
If you would not like to mess with custom scripts and Key-Value enhanced classification, then review the best practices below on how to improve your classification results.
Here are the tips and tricks that you can use to improve search classification: Always remember, to have the correct sample in the correct document folder, the below information focuses on how to manage your sample set of documents:
- Watch the following video. We have a tool designed to show you samples that are placed in wrong folders or samples that are so close to each other. You can eliminate the incorrect samples this way. http://www.youtube.com/watch?v=5MoG_UzuH4M&feature=plcp
- DO NOT put more than one sample if the content is same. Less is better. If you have two ABC Document samples and 80% of the content is same, remove the redundant sample
- Do not use illegible samples. A great way to train a document is to use a blank version of the form. Something that does not have any Meta data on it. Most of our customers download the blank forms from internet like HUD-1, Appraisal, Loan Application, Tax forms.. If Ephesoft can’t read them, it cannot learn. Always try to use 300-200 dpi tif images to train on.
- If you are not sure if the last page will always have the same information on it you can put all the middle pages into middle page folder and leave the last page folder empty
- Another way to improve results is to edit the HOCR.xml files that Ephesoft learned on. By removing items that are misleading and relearning the samples you will have a more specialized set of information.
- If you are seeing low confidence scores, you can change the settings in Document Assembler plugin.
- If you are seeing pages being attached to other documents, that means you do not have any sample for those pages. You can also reduce the middle page scoring in the Search Classification plugin.
Using these best practices can greatly improve your classification results.