AWS Sagemaker custom user algorithms: how to take

This is a fundamental AWS Sagemaker question. When I run training with one of Sagemaker's built in algorithms I am able to take advantage of the massive speedup from distributing the job to many instances by increasing the instance_count argument of the training algorithm. However, when I package my own custom algorithm then increasing the instance count seems to just duplicate the training on every instance, leading to no speedup.

I suspect that when I am packaging my own algorithm there is something special I need to do to control how it handles the training differently for a particular instance inside of the my custom train() function (otherwise, how would it know how the job should be distributed?), but I have not been able to find any discussion of how to do this online.

Does anyone know how to handle this? Thank you very much in advance.

Specific examples: => It works well in a standard algorithm: I verified that increasing train_instance_count in the first documented sagemaker example speeds things up here: https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-train-model-create-training-job.html

=> It does not work in my custom algorithm. I tried taking the standard sklearn build-your-own-model example and adding a few extra sklearn variants inside of the training and then printing out results to compare. When I increase the train_instance_count that is passed to the Estimator object, it runs the same training on every instance, so the output gets duplicated across each instance (the printouts of the results are duplicated) and there is no speedup. This is the sklearn example base: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb . The third argument of the Estimator object partway down in this notebook is what lets you control the number of training instances.

Distributed training requires having a way to sync the results of the training between the training workers. Most of the traditional libraries, such as scikit-learn are designed to work with a single worker, and can't just be used in a distributed environment. Amazon SageMaker is distributing the data across the workers, but it is up to you to make sure that the algorithm can benefit from the multiple workers. Some algorithms, such as Random Forest, are easier to take advantage of the distribution, as each worker can build a different part of the forest, but other algorithms need more help.

Spark MLLib has distributed implementations of popular algorithms such as k-means, logistic regression, or PCA, but these implementations are not good enough for some cases. Most of them were too slow and some even crushed when a lot of data was used for the training. The Amazon SageMaker team reimplemented many of these algorithms from scratch to benefit from the scale and economics of the cloud (20 hours of one instance costs the same as 1 hour of 20 instances, just 20 times faster). Many of these algorithms are now more stable and much faster beyond the linear scalability. See more details here: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

For the deep learning frameworks (TensorFlow and MXNet) SageMaker is using the built-in parameters server that each one is using, but it is taking the heavy lifting of the building the cluster and configuring the instances to communicate with it.