What’s New In Transact 4.5?
Machine Learning | Machine Learning for Tables
In previous versions of Transact, table extraction was performed only on the basis of the user-defined table extraction rules. In all cases when table data was not extracted or was incomplete, the user had to manually change or add the data on the Validation screen for every batch.
Now, the system can capture and learn characteristics of the table you define on the Validation screen. This information is then used to automatically extract tabular data in further batches. To implement this feature, the Learn Table button has been added in the Validation screen and a new Machine Learning Based Table Extraction Switch has been included in the Machine Learning Based Extraction plugin.
Machine learning for tables will work in both cases:
- When the user defines table extraction rules, but they fail to produce required results, and
- When the user doesn’t define any table extraction rule and specifies only the table columns.
1. Make sure that Table Extraction plugin and Machine Learning Based Extraction plugin are added in the Extraction Module.
2. On the Table Extraction Plugin screen, the Table Extraction Switch should be turned ON.
3. On the Machine Learning Based Extraction Plugin screen, both switches – Machine Learning Based Extraction Switch and Machine Learning Based Table Extraction Switch – should be turned ON.
Note: If you want to enable machine learning only for tables, you can keep Machine Learning Based Extraction Switch OFF.
1. On the Batch Class Management screen, open or create a Batch Class.
2. Add a Document Type and upload Learn file, Test Classification file, and Test Extraction file.
3. Assign the roles for machine learning classification and extraction.
Note: Roles for machine learning are assigned to enable the operators to use machine learning capabilities. If no roles are assigned at the Document Type level, it is assumed that machine learning is enabled only for super admin.
4. Navigate to the Tables screen and add a table.
5. Navigate to the Table Columns screen and add table columns.
6. Now, go to the Upload Batch screen, upload the file, select the Batch Class and priority and click Start Batch.
7. The batch processing is stopped at the validation stage. On the Batch Instance Management screen, select the batch and click Open.
8. On the Validation screen, click on the Table button on the top middle panel. The Table view opens. Since no table extraction rule has been defined, the table does not contain any data.
The Table view now includes new buttons for table machine learning.
|Insert||Use it to add rows to the table.|
|Delete||Use it to delete selected rows from the table.|
|Delete All||Use it to delete all rows.|
|Row Extraction||Use it to extract data for the entire table. Once you click it, all the buttons on the panel become unavailable except Start Extraction and Reject Extraction. A new row is added to the table, which is to be manually populated with values. This option is basically used to set a table extraction rule.|
|Column Extraction||Use it to extract data for a specific column. Once you click it, all the buttons on the panel become unavailable except Start Extraction and Reject Extraction.|
|Start Extraction||Use it to start the extraction process after selecting Row Extraction or Column Extraction and providing the corresponding table values.|
|Reject Extraction||Use it to stop the extraction process after selecting Row Extraction or Column Extraction. Once you click it, all entered values are deleted and all the buttons on the panel become available again except Start Extraction and Reject Extraction.|
|Use AutoExtraction||Use it to automatically locate table structures on a page and extract tabular data without creating extraction rules. Extracted data can then be filtered manually.|
|Learn Table||Use it to save machine learned data after completing the table configuration.|
9. Select Row Extraction to extract data for the entire table. A new row is added to the table.
10. Select the values for all columns:
- Place your cursor in the text box of a column to be learned.
- Create a corresponding overlay on the image by clicking on the required value (right click to create custom overlay over multiple values).
11. Click Start Extraction. The table is populated with extracted values.
Note: For successful table data extraction, the values have to be provided for all the columns, otherwise the following message is displayed:
12. If required, delete unnecessary rows by selecting the rows and clicking Delete.
The final table will look like the following:
Whenever required, you can use Column Extraction to change the content in a specific column. For that, click Column Extraction, place the cursor in the text field of the corresponding column and draw an overlay on the image to define a new value.
Click Start Extraction. The column is updated with newly selected values.
13. After the table data is extracted and edited, click Learn Table. Machine learning is confirmed by the message “Learning updated successfully”.
14. Click Validate to save the results. The extracted data is learned and will be used in the processing of further batches for this Document Type.
Note: The same mechanism for table machine learning is applied if the user defines table extraction rule(s) but table data is not extracted or is incomplete.
The table learning data will be kept in temporary files until the user finalizes the extracted values and clicks Validate on the Validation screen. Once the batch crosses the Validation stage, the learning data will be merged with the batch class learning as described below.
Let us process another batch of the same Batch Class. Since the extracted data is already learned, all further batches will be successfully processed till Finished state and will not require any invention of the operator.
However, for the purpose of checking the extraction results we can intentionally stop the batch processing at the Validation stage. For that, let us go back to the Index Fields screen and use Force Review option from the Additional Configurations dropdown for one of the index fields.
Let us start the batch processing by uploading the file on the Upload Batch screen and clicking Start Batch.
The processing stops at validation stage. On the Batch Instance Management screen, select the batch and click Open.
On the Validation screen, both index fields data and table data is extracted successfully on the basis of machine learned information.
The table data will be extracted on the basis of machine learned information in all further batches even if the Table Extraction switch is turned OFF in the Extraction module.
1. Table machine learning can be used in conjunction with rule-based table extraction, i.e. when the user defines table extraction rules (refer here), however, then changes and learns new data at the time of validation. In this case, the machine-learning algorithm will update and save the latest learned data once the batch is validated. Next time the files of the same Batch Class and Document Type are processed again, the system will extract the values using the defined table extraction rules as well as the machine-learned data.
2. Column data extracted with confidence lower than defined threshold is marked by red on the Validation screen:
Once the user validates the extracted value (either as is or changed), the data will automatically be machine-learned and saved.
3. Multiple learnings from several UI servers is not possible – once the batch is opened for validation by one user, it will be locked. If any other user tries to open the same batch, the following message is displayed:
4. In case no table data is extracted, either there should be some invalid Index Field (or Index Field set for Force Review, see Batch Class -> Document Types -> Index Fields -> Additional Configuration dropdown) or the document needs to be set invalid using scripts to make sure the batch stops at validation, for learning purpose.
Suppose, you want to use machine learning solely for the table data extraction (without defining any table extraction rules and without extracting any index fields). In such cases, the application assumes that there is no data to be fetched and the batch will be processed all the way through till the Finished state without stopping at validation. In order to be able to perform table machine learning in this situation, you can create your own custom script to mark the batch as invalid if no data is extracted. Then, the batch will stop at the validation stage and you will be able to configure a table to be machine learned on the Validation screen.
For that, you can modify the default extraction script provided for each Batch Class in the UNC folder (created at the time of the Batch Class configuration).
In the Ephesoft Transact, navigate to the Extraction module of your Batch Class and add Extraction Scripting plugin. Make sure to place it after all other extraction plugins. Then click Apply and Deploy to save the changes.
Information about machine learned tables is stored in the table-ml-configuration subfolder of the UNC folder created for each Batch Class. The location of the UNC folder is defined at the time of the Batch Class configuration.
The subfolder contains machine learned data for each table under specific Document Type. Information about the table columns and rows is stored in two separate folders – column-feature and row-feature.
The column-feature folder contains json file with information related to machine learned table columns.
The row-feature folder contains arff file with information related to rows. Rows deleted at the Validation screen are also included in this file and marked as invalid.
On the basis of the information contained in the row-feature file, the system creates a randomForest model, which is then used for processing of all further batches.
Let us suppose, we have configured a table containing ten rows. The system creates a randomForest model for the table and applies it to all further batches with 10-row tables. If the user updates the table and, for example, adds five more rows, the randomForest model is updated as well to include the latest information. So next time the batch is executed, there will be 15 lines of information extracted and so on.
Copy/Import/Export/Delete Operations at the Batch Class, Document Type and Table Level
Whenever the Batch Class, Document Type, or Table is copied, imported, exported, or deleted, the learning files for tables are copied, imported, exported, or deleted as well.
Note: Table machine learning support is not provided for two-column layout tables and horizontal tables.