Resume Service is executed on every node of a multi-clustered environment. This service is responsible for resuming the following for the down node in a multi clustered environment.
- RUNNING batches of the down node.
- Single node services (Folder Monitor, PickUp Service, Application Script etc.) own by the down node.
- Auto restarting batches errored due to network glitch.
Since this service runs on all the nodes, there were cases during heavy load when more than 1 node worked on resuming the same thing even though cron jobs were different. Thus, to handle this scenario, a synchronization mechanism is introduced at very granular level i.e., per batch. Now, a resuming node will first own the responsibility of a batch before resuming it and if some other node has own the responsibility it will leave that particular batch and move to next batch.
The owning node can be fetched from the ‘batch_instance’ table in the column ‘resuming_server’. Also, this will be set to NULL once the batch resumes execution again.
Also see: Auto restart network error batches