This is a fundamental AWS Sagemaker question. When I run training with one of Sagemaker's built in algorithms I am able to take advantage of the massive speedup from distributing the job to many instances by increasing the instance_count argument of the training algorithm. However, when I package my own custom algorithm then increasing the instance count seems to just duplicate the training on every instance, leading to no speedup.
I suspect that when I am packaging my own algorithm there is something special I need to do to control how it handles the training differently for a particular instance inside of the my custom train() function (otherwise, how would it know how the job should be distributed?), but I have not been able to find any discussion of how to do this online.
Does anyone know how to handle this? Thank you very much in advance.
Specific examples: => It works well in a standard algorithm: I verified that increasing train_instance_count in the first documented sagemaker example speeds things up here: https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html
=> It does not work in my custom algorithm. I tried taking the standard sklearn build-your-own-model example and adding a few extra sklearn variants inside of the training and then printing out results to compare. When I increase the train_instance_count that is passed to the Estimator object, it runs the same training on every instance, so the output gets duplicated across each instance (the printouts of the results are duplicated) and there is no speedup. This is the sklearn example base: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb . The third argument of the Estimator object partway down in this notebook is what lets you control the number of training instances.