I am working on batch processing problem. Solution needs to handle failing hardware.
There is master node (which initiates tasks executions) and worker nodes which execute the jobs. I know how failover of worker nodes works but I could not find any information about failover of master nodes. Whenever master node which started a task fails the whole task is canceled.
Is there any way to finish task processing then?
Could you suggest the best way of implementing failover of master node?
Kind Regards,
Kuba
Whenever your master node dies, basically there is noone to perform the "reduce" step of your MapReduce task.
There are several ways you can try mitigating this problem:
Save intermediate checkpoints using GridCheckpointSpi (GridTaskSession.saveCheckpoint(..) API) and then when your task restarts after node crash, you can check if there is a checkpoint saved and start from it.
Do the same as in (1), but use the data grid instead (GridCache API).
If you don't care about "reduce", have your jobs ignore the "cancel" call and just have them save the results in data grid when they are done.
--Best